Skip to content

Node Population Estimation

The Hidden-Node Problem

Network crawlers such as Nebula traverse the Gnosis Chain's libp2p DHT and record every peer they can reach. However, a significant fraction of validators sit behind NAT gateways, firewalls, or restrictive cloud security groups and are never directly contacted. Reporting only the observed node count would systematically underestimate the true network size and, by extension, the energy footprint.

graph LR
    A[Nebula Crawler] -->|Discovers| B[Visible Nodes]
    A -.->|Cannot Reach| C[Hidden Nodes]
    B --> D[Observed Count S_obs]
    C --> E[Estimated via Chao-1]
    D --> F[Total Population Estimate]
    E --> F

Why raw counts are insufficient

A naive count of observed peers typically captures only 55--65 % of the true validator population. Nodes behind NAT, VPN tunnels, or ephemeral cloud instances are systematically missed, leading to a substantial undercount of both network size and energy consumption.


Chao-1 Nonparametric Estimator

The module applies the Chao-1 estimator (Chao, 1984), a species-richness method from ecology that infers the number of unseen classes from the frequency distribution of rare observations. In our context, a "class" is a unique node and the observation frequency is the number of independent crawl sessions in which that node was successfully contacted.

Definitions

Symbol Meaning
S_obs Number of distinct nodes observed across all crawl sessions
f1 Number of singletons -- nodes seen in exactly 1 session
f2 Number of doubletons -- nodes seen in exactly 2 sessions

Standard Chao-1 Formula

The standard Chao-1 estimate of total node population is:

\[ \hat{S}_{\text{Chao1}} = S_{\text{obs}} + \frac{f_1^2}{2 \, f_2} \]

Bias-Corrected Form

When \(f_2 = 0\) (no doubletons), the bias-corrected form is used:

\[ \hat{S}_{\text{Chao1}} = S_{\text{obs}} + \frac{f_1 \,(f_1 - 1)}{2} \]

When to use the bias-corrected form

In practice, \(f_2 = 0\) is rare for multi-session crawls. The bias-corrected variant guards against division-by-zero and is automatically selected by the dbt model when the doubleton count drops to zero.

Variance and Confidence Intervals

The analytical variance of the Chao-1 estimator is:

\[ \widehat{\text{Var}}(\hat{S}_{\text{Chao1}}) = f_2 \left[\frac{1}{4}\left(\frac{f_1}{f_2}\right)^4 + \left(\frac{f_1}{f_2}\right)^3 + \frac{1}{2}\left(\frac{f_1}{f_2}\right)^2\right] \]

A 95 % confidence interval is constructed using a log-transformation to ensure the lower bound never falls below S_obs:

\[ \left[\; S_{\text{obs}} + \frac{\hat{S}_{\text{Chao1}} - S_{\text{obs}}}{C}, \;\; S_{\text{obs}} + C \cdot (\hat{S}_{\text{Chao1}} - S_{\text{obs}}) \;\right] \]

where \(C = \exp\!\bigl(1.96 \sqrt{\ln(1 + \widehat{\text{Var}} / (\hat{S}_{\text{Chao1}} - S_{\text{obs}})^2)}\bigr)\).

Reference: Chao & Chiu (2016)

The log-transform confidence interval follows the methodology described in Chao & Chiu (2016), Species Richness: Estimation and Comparison. The log-normal approach yields asymmetric intervals that respect the natural lower bound of \(S_{\text{obs}}\) and perform well even when the sample coverage is low.


Failure-Recovery Augmentation

Not every failed connection represents a permanently invisible node. Some failures are transient and can be resolved by retry logic. The recovery layer re-attempts failed peers and assigns a recovery rate based on the failure type:

Failure Mode Recovery Rate Description
Timeout 30 % Peer was reachable but did not respond within the deadline
Connection Refused 10 % TCP handshake rejected; port may be intermittently closed
Unreachable 5 % No route to host; typically hard NAT or offline nodes
Protocol Mismatch 80 % Peer responded but with an incompatible protocol version
Other 20 % Catch-all for unclassified connection errors

Recovered Count Formula

The recovered node count for each failure category is:

\[ N_{\text{recovered}} = \sum_{i \in \text{failure types}} \text{failed}_i \times \text{rate}_i \]

The total adjusted population becomes:

\[ \hat{S}_{\text{total}} = \hat{S}_{\text{Chao1}} + N_{\text{recovered}} \]

Recovery rates are empirically calibrated

The rates above are derived from periodic full-retry experiments run against the Gnosis Chain mainnet. They are reviewed quarterly and may change as network conditions evolve.


dbt Implementation

The Chao-1 estimation is split across two dbt models: a staging model that consolidates raw crawl data and an intermediate model that computes the estimator.

stg_chao1_observers

This staging model consolidates crawl data from multiple independent observer instances into a single unified stream. Each observer contributes its own perspective of the network, which is essential for the frequency-based estimator.

-- stg_chao1_observers.sql
-- Consolidates crawl data from multiple observer instances

WITH observer_alpha AS (
    SELECT
        crawl_timestamp,
        peer_id,
        ip_address,
        'alpha' AS observer_id
    FROM {{ source('nebula', 'crawl_results_alpha') }}
),

observer_beta AS (
    SELECT
        crawl_timestamp,
        peer_id,
        ip_address,
        'beta' AS observer_id
    FROM {{ source('nebula', 'crawl_results_beta') }}
)

SELECT * FROM observer_alpha
UNION ALL
SELECT * FROM observer_beta

Output columns:

Column Type Description
crawl_timestamp DateTime Timestamp of the crawl session
peer_id String Unique libp2p peer identifier
ip_address String IP address observed during the crawl
observer_id String Identifier for the observer instance (alpha, beta, etc.)

int_esg_node_population_chao1

This intermediate model groups crawl observations into hourly windows and applies the Chao-1 estimator within each window.

-- int_esg_node_population_chao1.sql
-- Applies Chao-1 estimator in hourly windows

WITH peer_frequencies AS (
    -- Count how many sessions each peer was seen in per hour
    SELECT
        toStartOfHour(crawl_timestamp) AS hour_timestamp,
        peer_id,
        count(DISTINCT observer_id) AS session_count
    FROM {{ ref('stg_chao1_observers') }}
    GROUP BY hour_timestamp, peer_id
),

singletons AS (
    -- F1: peers seen in exactly 1 session
    SELECT
        hour_timestamp,
        countIf(session_count = 1) AS f1
    FROM peer_frequencies
    GROUP BY hour_timestamp
),

doubletons AS (
    -- F2: peers seen in exactly 2 sessions
    SELECT
        hour_timestamp,
        countIf(session_count = 2) AS f2
    FROM peer_frequencies
    GROUP BY hour_timestamp
),

observed AS (
    -- S_obs: total distinct peers per hour
    SELECT
        hour_timestamp,
        count(DISTINCT peer_id) AS s_obs
    FROM peer_frequencies
    GROUP BY hour_timestamp
)

SELECT
    o.hour_timestamp,
    o.s_obs,
    s.f1,
    d.f2,
    -- Chao-1 estimate (bias-corrected when f2 = 0)
    CASE
        WHEN d.f2 > 0
            THEN o.s_obs + (pow(s.f1, 2) / (2.0 * d.f2))
        ELSE
            o.s_obs + (s.f1 * (s.f1 - 1)) / 2.0
    END AS chao1_estimate,
    -- 7-day moving average for smoothing
    avg(
        CASE
            WHEN d.f2 > 0
                THEN o.s_obs + (pow(s.f1, 2) / (2.0 * d.f2))
            ELSE
                o.s_obs + (s.f1 * (s.f1 - 1)) / 2.0
        END
    ) OVER (
        ORDER BY o.hour_timestamp
        ROWS BETWEEN 167 PRECEDING AND CURRENT ROW  -- 7 days * 24 hours
    ) AS chao1_estimate_7day_avg
FROM observed o
LEFT JOIN singletons s ON o.hour_timestamp = s.hour_timestamp
LEFT JOIN doubletons d ON o.hour_timestamp = d.hour_timestamp
ORDER BY o.hour_timestamp

Output columns:

Column Type Description
hour_timestamp DateTime Start of the hourly observation window
chao1_estimate Float64 Point estimate of total node population
chao1_estimate_7day_avg Float64 7-day rolling average of the Chao-1 estimate

Typical Results

The table below shows a representative breakdown for a recent estimation window:

Component Count Source
Directly observed nodes (S_obs) ~1,200 Nebula crawl sessions
Hidden nodes (Chao-1 unseen estimate) ~300 \(f_1^2 / (2 f_2)\)
Recovered from failed connections ~300 Failure-recovery augmentation
Total estimated population ~2,200 Combined estimate

Why this matters

Without the Chao-1 estimator and failure-recovery augmentation, the reported node count would be roughly 1,200 instead of 2,200 -- an undercount of approximately 45 %. This directly affects the accuracy of the network's energy consumption estimate and, consequently, the ESG report's credibility.

pie title Node Population Breakdown
    "Directly Observed" : 1200
    "Hidden (Chao-1)" : 300
    "Recovered (Retries)" : 300