click-runner¶
click-runner is a modular Python toolkit for loading data into ClickHouse from various external sources. It supports SQL query execution, CSV ingestion via ClickHouse URL engine, and Parquet ingestion from S3 buckets.
Purpose¶
Not all data in the Gnosis Analytics pipeline comes from blockchain nodes. click-runner handles the ingestion of external datasets:
- Ember -- global electricity generation data used for ESG carbon footprint calculations
- ProbeLab -- daily peer-to-peer network metrics (agent versions, peer distributions, crawl statistics)
- Administrative queries -- schema migrations, data maintenance, and custom SQL operations
Ingestion Modes¶
click-runner operates in three modes, selected via the --ingestor parameter:
Query Mode¶
Executes arbitrary SQL files against ClickHouse. Used for administrative tasks, schema updates, and custom transformations.
CSV Mode¶
Imports data from CSV files using ClickHouse's built-in URL engine. The typical workflow involves three SQL files: table creation, data insertion (reading from a remote CSV URL), and optional optimization.
python run_queries.py --ingestor=csv \
--create-table-sql=queries/ember/create_ember_table.sql \
--insert-sql=queries/ember/insert_ember_data.sql \
--optimize-sql=queries/ember/optimize_ember_data.sql
Parquet Mode¶
Imports data from Parquet files stored in S3 buckets. Supports three sub-modes:
| Sub-mode | Description | Use Case |
|---|---|---|
latest | Import only the most recent file | Daily incremental updates |
date | Import a file for a specific date | Targeted backfills |
all | Import all available files | Full historical load |
# Import latest file
python run_queries.py --ingestor=parquet \
--create-table-sql=queries/probelab/probelab_agent_semvers_avg_1d.up.sql \
--s3-path=assets/agent_semvers_avg_1d_data/{{DATE}}.parquet \
--table-name=crawlers_data.probelab_agent_semvers_avg_1d \
--mode=latest
# Import file for a specific date
python run_queries.py --ingestor=parquet \
--create-table-sql=queries/probelab/probelab_agent_semvers_avg_1d.up.sql \
--s3-path=assets/agent_semvers_avg_1d_data/{{DATE}}.parquet \
--table-name=crawlers_data.probelab_agent_semvers_avg_1d \
--mode=date --date=2025-04-13
Configuration¶
ClickHouse Connection¶
| Variable | Default | Description |
|---|---|---|
CH_HOST | -- | ClickHouse server hostname |
CH_PORT | 9000 | ClickHouse native protocol port |
CH_USER | -- | Authentication username |
CH_PASSWORD | -- | Authentication password |
CH_DB | -- | Target database name |
CH_SECURE | False | Use TLS connection |
CH_VERIFY | False | Verify TLS certificate |
S3 Integration¶
| Variable | Default | Description |
|---|---|---|
S3_ACCESS_KEY | -- | AWS access key ID |
S3_SECRET_KEY | -- | AWS secret access key |
S3_BUCKET | prod-use1-gnosis | S3 bucket name |
S3_REGION | us-east-1 | AWS region |
Data Source URLs¶
| Variable | Description |
|---|---|
EMBER_DATA_URL | URL to the Ember electricity CSV data |
Variable Substitution in SQL¶
SQL files support {{VARIABLE_NAME}} placeholders that are replaced with values from environment variables prefixed with CH_QUERY_VAR_. For example:
- Environment variable:
CH_QUERY_VAR_EMBER_DATA_URL=https://example.com/data.csv - In SQL:
FROM url('{{EMBER_DATA_URL}}', 'CSV')
Docker Compose Services¶
The docker-compose.yml provides preconfigured services:
| Service | Mode | Description |
|---|---|---|
click-runner | Any | Generic service, accepts any --ingestor argument |
ember-ingestor | CSV | Preconfigured for Ember electricity data |
probelab-agent-semvers-ingestor | Parquet | Preconfigured for ProbeLab agent version data |
# Run the Ember ingestor
docker-compose run --rm ember-ingestor
# Run the ProbeLab ingestor
docker-compose run --rm probelab-agent-semvers-ingestor
# Run arbitrary queries
docker-compose run click-runner --ingestor=query \
--queries=queries/file1.sql
Scheduling¶
click-runner services are typically scheduled as daily cron jobs:
# Daily Ember data update at 2 AM
0 2 * * * cd /path/to/click-runner && docker-compose run --rm ember-ingestor
# Daily ProbeLab data update at 3 AM
0 3 * * * cd /path/to/click-runner && docker-compose run --rm probelab-agent-semvers-ingestor
A convenience script cron_setup.sh is included to automatically create these cron entries.
Adding New Data Sources¶
New CSV Source¶
- Create SQL files in
queries/new_source/:create_table.sql-- table schema definitioninsert_data.sql-- INSERT using ClickHouse URL engineoptimize.sql(optional) -- post-load optimization
- Set the data URL as an environment variable
- Run with
--ingestor=csv
New Parquet Source¶
- Create a table definition SQL file in
queries/new_source/ - Run with
--ingestor=parquet, specifying the S3 path pattern with{{DATE}}placeholder - Add a Docker Compose service for convenient scheduling
New File Format¶
Extend the framework by creating a new ingestor class that inherits from BaseIngestor in ingestors/:
- Create
ingestors/new_format_ingestor.pyextendingBaseIngestor - Implement the
ingest()method - Register the new ingestor type in
run_queries.py
Project Structure¶
click-runner/
├── run_queries.py # Main CLI entry point
├── ingestors/
│ ├── base.py # Abstract base ingestor
│ ├── csv_ingestor.py # CSV ingestion logic
│ └── parquet_ingestor.py # Parquet/S3 ingestion logic
├── utils/
│ ├── s3.py # S3 file discovery utilities
│ ├── db.py # ClickHouse connection helpers
│ └── date.py # Date parsing utilities
├── queries/
│ ├── ember/ # Ember electricity data SQL
│ └── probelab/ # ProbeLab metrics SQL
├── Dockerfile
└── docker-compose.yml
ClickHouse Table Schemas¶
All tables are stored in the crawlers_data database.
Table: crawlers_data.dune_labels
Engine: MergeTree ORDER BY: (address, label, introduced_at)
| Column | Type | Notes |
|---|---|---|
address | String | Contract or EOA address |
label | String | Human-readable label |
introduced_at | DateTime | When the label was first seen |
source | LowCardinality(String) | Label data source |
Table: crawlers_data.dune_prices
Engine: MergeTree ORDER BY: (symbol, block_date)
| Column | Type | Notes |
|---|---|---|
block_date | Date | Price date |
symbol | LowCardinality(String) | Token symbol |
price | Float64 | Token price in USD |
Table: crawlers_data.dune_bridge_flows
Engine: MergeTree ORDER BY: (bridge, source_chain, dest_chain, token, timestamp)
| Column | Type | Notes |
|---|---|---|
timestamp | DateTime | Bridge event timestamp |
bridge | LowCardinality(String) | Bridge protocol name |
source_chain | LowCardinality(String) | Origin chain |
dest_chain | LowCardinality(String) | Destination chain |
token | LowCardinality(String) | Bridged token |
amount_token | Float64 | Amount in token units |
amount_usd | Float64 | Amount in USD |
net_usd | Float64 | Net USD flow |
Table: crawlers_data.dune_gno_supply
Engine: MergeTree ORDER BY: (label, block_date)
| Column | Type | Notes |
|---|---|---|
label | LowCardinality(String) | Supply category label |
block_date | Date | Snapshot date |
supply | Float64 | GNO supply amount |
Table: crawlers_data.ember_electricity_data
Engine: ReplacingMergeTree(version) ORDER BY: (Date, Area, Category, Subcategory, Variable)
| Column | Type | Notes |
|---|---|---|
Area | String | Country or region |
ISO 3 code | String | ISO 3166-1 alpha-3 country code |
Date | Date | Data date |
Continent | String | Continent name |
Category | String | Electricity data category |
Subcategory | String | Electricity data subcategory |
Variable | String | Measured variable |
Unit | String | Measurement unit |
Value | Float64 | Measured value |
YoY absolute change | Float64 | Year-over-year absolute change |
YoY % change | Float64 | Year-over-year percentage change |
version | DateTime | Row version for deduplication |
Table: crawlers_data.gpay_wallets
Engine: MergeTree ORDER BY: (SAFE_address, SAFE_createdAt)
| Column | Type | Notes |
|---|---|---|
SAFE_address | String | Safe wallet address |
SAFE_createdAt | DateTime | Wallet creation timestamp |