Skip to content

ESG Data Pipeline

This page documents the complete dbt model DAG (directed acyclic graph) for the ESG reporting pipeline, including detailed specifications for every model from staging through to the final mart layer.

Architecture Overview

The ESG pipeline follows the standard dbt layered architecture: staging models normalize raw source data, intermediate models apply business logic and statistical methods, and mart models produce the final fact tables and API-ready views.

flowchart LR
    subgraph Sources
        S1["nebula.visits"]
        S2["crawlers_data.ip_metadata"]
        S3["crawlers_data.ember_electricity"]
        S4["consensus.validators"]
    end

    subgraph Staging
        ST1["stg_crawlers_data__country_codes"]
        ST2["stg_crawlers_data__ember_electricity_data"]
    end

    subgraph Population Estimation
        PE["int_esg_node_population_chao1"]
    end

    subgraph Node Enrichment
        NE1["int_esg_node_classification"]
        NE2["int_esg_node_geographic_distribution"]
        NE3["int_esg_node_client_distribution"]
    end

    subgraph Energy & Carbon
        EC1["int_esg_dynamic_power_consumption"]
        EC2["int_esg_carbon_intensity_ensemble"]
    end

    subgraph Output
        O1["fct_esg_carbon_footprint_uncertainty"]
        O2["api_esg_* (9 models)"]
    end

    S1 --> PE
    S2 --> NE2
    S3 --> ST2
    S4 --> PE

    ST2 --> EC2

    PE --> NE1
    PE --> NE2
    PE --> NE3

    NE1 --> EC1
    NE3 --> EC1

    EC1 --> O1
    EC2 --> O1
    NE2 --> O1

    O1 --> O2

Incremental Materialization

All intermediate and mart models are incrementally materialized and partitioned by date. On each dbt run, only new or updated date partitions are processed. dbt tests validate row counts, not-null constraints, and cross-model referential integrity.


Staging Layer

stg_crawlers_data__country_codes

Country code reference table providing standardized ISO 3166-1 mappings.

Model Specification

Purpose: Normalize country identifiers across all upstream data sources (Ember, IP geolocation, manual mappings) to a consistent ISO 3166-1 alpha-2 standard.

Inputs:

  • crawlers_data.country_codes (seed/reference table)

Key Transformations:

  • Standardize country name variations (e.g., "USA", "United States", "US" all map to US)
  • Map ISO alpha-3 codes to alpha-2
  • Handle disputed territories and historical country codes

Output Columns:

Column Type Description
country_code String ISO 3166-1 alpha-2 code
country_name String Standardized English country name
continent String Continent classification
region String Sub-continental region

stg_crawlers_data__ember_electricity_data

Ingests and normalizes raw Ember Global Electricity Review data.

Model Specification

Purpose: Transform raw Ember CSV data into a structured, query-ready format with standardized country codes and pivoted generation shares.

Inputs:

  • crawlers_data.ember_electricity (raw CSV ingestion)
  • stg_crawlers_data__country_codes (country standardization)

Key Transformations:

  • Parse Ember CSV columns (country, year, month, generation type, TWh)
  • Standardize country names to ISO 3166-1 alpha-2 via country codes reference
  • Pivot generation data from wide format (one column per fuel type) to long format
  • Calculate percentage shares per generation type per country-month
  • Filter to the most recent available month per country

Output Columns:

Column Type Description
country_code String ISO 3166-1 alpha-2 code
date Date First day of the month
generation_type String Fuel/source type (coal, gas, solar, etc.)
generation_twh Float64 Generation in terawatt-hours
share_pct Float64 Percentage share of total generation

Intermediate Layer

int_esg_node_population_chao1

Estimates the total node population (including hidden nodes) using the Chao-1 ecological estimator.

Model Specification

Purpose: Apply the Chao-1 nonparametric estimator to Nebula crawler observations to estimate the true network size, including nodes not directly observed.

Inputs:

  • nebula.visits (crawler observation records)
  • consensus.validators (active validator set)

Key Transformations:

  • Aggregate daily unique node observations from Nebula crawl sessions
  • Compute observation frequency histogram: \(f_1\) (nodes seen exactly once), \(f_2\) (nodes seen exactly twice)
  • Apply Chao-1 estimator: \(\hat{N} = S_{obs} + \frac{f_1^2}{2 f_2}\)
  • Compute standard error and confidence intervals using Chao's variance formula
  • Cross-reference with consensus layer validator count for validation

Output Columns:

Column Type Description
date Date Observation date
nodes_observed UInt32 Directly observed unique nodes (\(S_{obs}\))
f1 UInt32 Nodes observed exactly once
f2 UInt32 Nodes observed exactly twice
nodes_estimated Float64 Chao-1 population estimate (\(\hat{N}\))
nodes_hidden Float64 Estimated hidden nodes (\(\hat{N} - S_{obs}\))
std_error Float64 Standard error of the estimate
ci_lower_95 Float64 95% CI lower bound
ci_upper_95 Float64 95% CI upper bound

int_esg_node_classification

Classifies nodes into operational categories (Home, Professional, Cloud) based on observed characteristics.

Model Specification

Purpose: Assign each observed node to a category that determines its power consumption profile and PUE.

Inputs:

  • int_esg_node_population_chao1 (estimated node population)
  • Hosting provider and ASN reference data

Key Transformations:

  • Match node IP addresses against known cloud provider ASN ranges (AWS, GCP, Azure, Hetzner, OVH, etc.)
  • Classify remaining nodes as Professional or Home based on ISP type, uptime patterns, and port configurations
  • Apply classification proportions to hidden node estimate (assumes hidden nodes follow same distribution)
  • Compute daily category counts with uncertainty

Output Columns:

Column Type Description
date Date Observation date
category String Node category: home, professional, cloud
nodes_observed UInt32 Observed nodes in this category
nodes_estimated Float64 Estimated total (including hidden)
fraction Float64 Category share of total network
classification_confidence Float64 Confidence score for classification

int_esg_node_geographic_distribution

Maps nodes to countries using IP geolocation data.

Model Specification

Purpose: Determine the geographic distribution of nodes across countries to enable country-level carbon intensity weighting.

Inputs:

  • int_esg_node_population_chao1 (estimated node population)
  • crawlers_data.ip_metadata (IP geolocation database)
  • stg_crawlers_data__country_codes (country standardization)

Key Transformations:

  • Join node IPs to geolocation database for country-level resolution
  • Flag nodes using known VPN/proxy services (reduces misattribution)
  • Apply geographic distribution proportions to hidden node estimates
  • Aggregate to daily country-level node counts
  • Compute geographic concentration metrics (Herfindahl index)

Output Columns:

Column Type Description
date Date Observation date
country_code String ISO 3166-1 alpha-2 code
country_name String Country name
continent String Continent
nodes_observed UInt32 Observed nodes in country
nodes_estimated Float64 Estimated total in country
share_pct Float64 Country share of total network
is_vpn_flagged Boolean Whether VPN/proxy detected

int_esg_node_client_distribution

Tracks the distribution of execution and consensus client software across the network.

Model Specification

Purpose: Determine which client software combinations nodes are running to assign client-specific power consumption estimates.

Inputs:

  • int_esg_node_population_chao1 (estimated node population)
  • Nebula client identification data

Key Transformations:

  • Parse client user-agent strings from Nebula crawl data
  • Identify execution client (Nethermind, Erigon, etc.) and consensus client (Lighthouse, Teku, Lodestar, Nimbus)
  • Compute daily client pair distributions
  • Apply distribution proportions to hidden node estimates

Output Columns:

Column Type Description
date Date Observation date
execution_client String Execution layer client name
consensus_client String Consensus layer client name
nodes_observed UInt32 Observed nodes with this client pair
nodes_estimated Float64 Estimated total with this client pair
share_pct Float64 Share of total network

int_esg_dynamic_power_consumption

Computes dynamic power consumption estimates per node category and client combination.

Model Specification

Purpose: Estimate per-node power consumption (watts) based on hardware category and client software, using CCRI benchmark data.

Inputs:

  • int_esg_node_classification (node categories)
  • int_esg_node_client_distribution (client software distribution)
  • CCRI benchmark reference data

Key Transformations:

  • Map client combinations to CCRI power consumption benchmarks
  • Apply category-specific PUE multipliers (Home=1.0, Professional=1.2, Cloud=1.1--1.4)
  • Compute weighted average power consumption across client distributions
  • Calculate power uncertainty bands from benchmark measurement variance
  • Produce total network energy consumption estimate (kWh/day)

Output Columns:

Column Type Description
date Date Observation date
category String Node category
client_pair String Execution + consensus client
power_w Float64 Estimated power consumption (watts)
pue Float64 Power Usage Effectiveness factor
power_w_with_pue Float64 Power including PUE overhead
power_std Float64 Power uncertainty (1-sigma, watts)
energy_kwh_day Float64 Daily energy consumption (kWh)

int_esg_carbon_intensity_ensemble

Computes country-level carbon intensity with full uncertainty quantification.

Model Specification

Purpose: Transform Ember generation mix data into carbon intensity estimates (gCO2/kWh) with temporal and measurement uncertainty bands.

Inputs:

  • stg_crawlers_data__ember_electricity_data (generation mix)
  • stg_crawlers_data__country_codes (country reference)
  • Emissions factor reference data

Key Transformations:

  • Apply generation-type emissions factors to compute weighted average CI per country
  • Classify grids by CI range and assign grid-type uncertainty
  • Apply continent-level seasonal adjustment factors
  • Combine temporal and measurement uncertainties in quadrature
  • Apply fallback hierarchy for countries without Ember data

See Carbon Intensity Model for full methodology details.

Output Columns:

Column Type Description
date Date Reference date
country_code String ISO 3166-1 alpha-2 code
ci_mean Float64 Carbon intensity mean (gCO2/kWh)
ci_std_total Float64 Combined uncertainty (gCO2/kWh)
ci_lower_95 Float64 95% CI lower bound
ci_upper_95 Float64 95% CI upper bound
seasonal_factor Float64 Applied seasonal adjustment
grid_classification String Grid type (very clean/clean/mixed/fossil-heavy)
data_source String Source: ember, world_average, or fallback

Mart Layer

fct_esg_carbon_footprint_uncertainty

The central fact table producing daily carbon footprint estimates with full uncertainty propagation.

Model Specification

Purpose: Combine node population, power consumption, geographic distribution, and carbon intensity into a single daily carbon footprint estimate with disaggregated uncertainty.

Inputs:

  • int_esg_dynamic_power_consumption (energy estimates)
  • int_esg_carbon_intensity_ensemble (CI estimates)
  • int_esg_node_geographic_distribution (country distribution)
  • int_esg_node_population_chao1 (population estimates)
  • int_esg_node_classification (category breakdowns)

Key Transformations:

  • Cross-join categories and countries to compute per-cell (category x country) emissions
  • Propagate uncertainties from all upstream sources in quadrature
  • Aggregate to daily network-level totals with combined uncertainty
  • Compute annualized projections (daily * 365.25)
  • Calculate confidence intervals at 68%, 90%, 95%, and 99% levels

Output: A 56+ column fact table with daily granularity. Key column groups include:

Column Group Example Columns Description
Date & identity date Partition key
Node counts nodes_observed, nodes_estimated, nodes_hidden, nodes_std Population estimates
Category breakdown nodes_home, nodes_professional, nodes_cloud Per-category counts
Energy energy_kwh_day, energy_kwh_day_std, energy_mwh_year Energy consumption
Carbon intensity ci_effective, ci_effective_std Network-weighted CI
Emissions co2_kg_day, co2_kg_day_std, co2_tonnes_year Carbon footprint
Uncertainty combined_rel_uncertainty, co2_ci_lower_95, co2_ci_upper_95 Confidence intervals
Per-category emissions co2_kg_day_home, co2_kg_day_professional, co2_kg_day_cloud Disaggregated emissions

API Models

Nine API-facing models transform the fact table into endpoint-specific views. All are materialized as tables optimized for low-latency queries.

API Model Specifications

api_esg_carbon_emissions_daily

Daily carbon emissions with headline metrics. Serves the primary emissions timeseries endpoint.

  • Key columns: date, co2_kg_day, co2_kg_day_lower, co2_kg_day_upper, energy_kwh_day, nodes_estimated
  • Sort key: date DESC

api_esg_carbon_emissions_annualised_latest

Latest annualized carbon emissions projection with uncertainty bounds. Single-row snapshot.

  • Key columns: co2_tonnes_year, co2_tonnes_year_lower_95, co2_tonnes_year_upper_95, energy_mwh_year, as_of_date

api_esg_carbon_timeseries_bands

Daily emissions with confidence bands for chart rendering. Includes 68%, 90%, and 95% intervals.

  • Key columns: date, co2_kg_day, band_68_lower, band_68_upper, band_90_lower, band_90_upper, band_95_lower, band_95_upper

api_esg_energy_monthly

Monthly aggregated energy consumption by category.

  • Key columns: month, energy_kwh_total, energy_kwh_home, energy_kwh_professional, energy_kwh_cloud

api_esg_energy_consumption_annualised_latest

Latest annualized energy consumption projection. Single-row snapshot.

  • Key columns: energy_mwh_year, energy_mwh_year_lower, energy_mwh_year_upper, avg_power_per_node_w, as_of_date

api_esg_cif_network_vs_countries_daily

Carbon intensity comparison: Gnosis network effective CI vs. individual countries.

  • Key columns: date, network_ci, country_code, country_ci, network_vs_country_ratio

api_esg_estimated_nodes_daily

Daily node count estimates with category breakdown.

  • Key columns: date, nodes_observed, nodes_estimated, nodes_home, nodes_professional, nodes_cloud, nodes_hidden_pct

api_esg_info_category_daily

Daily per-category breakdown of nodes, energy, and emissions.

  • Key columns: date, category, nodes, energy_kwh, co2_kg, power_w_avg, pue

api_esg_info_annual_daily

Rolling annual projections updated daily.

  • Key columns: date, co2_tonnes_year, energy_mwh_year, co2_per_validator_kg_year, comparison_btc_ratio, comparison_eth_ratio