Monitoring¶

The Gnosis Analytics platform uses Prometheus for metrics collection, Promtail and Loki for log aggregation, and configured alerting rules for proactive issue detection. This page documents the observability stack, key metrics, and alerting configuration.

Observability Stack¶

flowchart LR
    subgraph Services
        API[cerebro-api]
        IDX[Indexers]
        CRW[Crawlers]
    end
    subgraph Metrics
        PROM[Prometheus]
    end
    subgraph Logging
        PROMTAIL[Promtail]
        LOKI[Loki]
    end
    subgraph Alerting
        AM[Alertmanager]
    end

    API -->|metrics| PROM
    IDX -->|metrics| PROM
    CRW -->|metrics| PROM
    API -->|stdout/stderr| PROMTAIL
    IDX -->|stdout/stderr| PROMTAIL
    CRW -->|stdout/stderr| PROMTAIL
    PROMTAIL --> LOKI
    PROM --> AM
    AM -->|notifications| SLACK[Slack / Email]

Metrics Collection¶

Prometheus¶

Prometheus runs in the EKS cluster and scrapes metrics from all services via Kubernetes service discovery. It automatically discovers pods with the appropriate annotations.

Pod annotations for scraping:

metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8000"
    prometheus.io/path: "/metrics"

Key Metrics¶

API Metrics¶

Metric	Type	Description
`http_requests_total`	Counter	Total HTTP requests by method, path, and status code
`http_request_duration_seconds`	Histogram	Request latency distribution
`http_requests_in_progress`	Gauge	Number of requests currently being processed
`api_manifest_refresh_total`	Counter	Number of manifest refresh attempts
`api_manifest_refresh_errors_total`	Counter	Number of failed manifest refreshes
`api_endpoints_registered`	Gauge	Number of currently registered API endpoints
`api_rate_limit_hits_total`	Counter	Number of requests rejected by rate limiting

ClickHouse Query Metrics¶

Metric	Type	Description
`clickhouse_query_duration_seconds`	Histogram	ClickHouse query execution time
`clickhouse_query_errors_total`	Counter	Number of failed ClickHouse queries
`clickhouse_connection_pool_size`	Gauge	Current connection pool size
`clickhouse_rows_returned`	Histogram	Number of rows returned per query

Indexer Metrics¶

Metric	Type	Description
`indexer_blocks_processed_total`	Counter	Total blocks indexed
`indexer_current_block`	Gauge	Most recently indexed block number
`indexer_chain_head_block`	Gauge	Current chain head block number
`indexer_lag_blocks`	Gauge	Difference between chain head and indexed block
`indexer_processing_duration_seconds`	Histogram	Time to process each block batch

Crawler Metrics¶

Metric	Type	Description
`crawler_peers_discovered`	Gauge	Number of peers found in last crawl
`crawler_crawl_duration_seconds`	Histogram	Duration of each crawl cycle
`crawler_ips_enriched_total`	Counter	Total IP addresses enriched with geolocation
`crawler_api_errors_total`	Counter	External API call failures

Log Aggregation¶

Promtail + Loki¶

Promtail runs as a DaemonSet on every node, tailing container logs and shipping them to Loki for centralized storage and querying.

Log pipeline:

Containers write to stdout/stderr (standard Kubernetes logging)
Promtail discovers containers via Kubernetes API and tails their log files
Logs are enriched with labels: namespace, pod name, container name
Logs are pushed to Loki for storage
Logs are queryable via LogQL

Structured Logging¶

Services use structured JSON logging for machine-parseable log entries:

{
  "timestamp": "2024-03-15T10:30:45Z",
  "level": "INFO",
  "message": "Manifest refreshed successfully",
  "service": "cerebro-api",
  "models_loaded": 412,
  "duration_ms": 1250
}

Log Queries (LogQL)¶

Common LogQL queries for troubleshooting:

API errorsClickHouse connection issuesManifest refresh eventsIndexer progress

{namespace="cerebro", container="cerebro-api"} |= "ERROR"

{namespace="cerebro"} |= "ClickHouse" |= "error"

{container="cerebro-api"} |= "Manifest"

{namespace="indexers"} |= "block" | json | line_format "{{.block_number}}"

Alerting¶

Alertmanager¶

Prometheus Alertmanager handles alert routing and notification delivery. Alerts are routed to Slack channels and email based on severity.

Alert Rules¶

Critical Alerts¶

These alerts indicate service outages requiring immediate attention:

Alert	Condition	Severity
API Down	No successful health check responses for 5 minutes	Critical
ClickHouse Unreachable	All ClickHouse queries failing for 3 minutes	Critical
High Error Rate	HTTP 5xx error rate exceeds 5% for 5 minutes	Critical
Manifest Refresh Failing	No successful manifest refresh in 30 minutes	Critical

# Example Prometheus alert rule
groups:
  - name: cerebro-api
    rules:
      - alert: APIHighErrorRate
        expr: |
          sum(rate(http_requests_total{namespace="cerebro",status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total{namespace="cerebro"}[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "API error rate above 5%"
          description: >
            The cerebro-api 5xx error rate has exceeded 5% for the last 5 minutes.
            Current rate: {{ $value | humanizePercentage }}.

Warning Alerts¶

These alerts indicate degraded performance or potential issues:

Alert	Condition	Severity
High Latency	P95 API latency exceeds 5 seconds for 10 minutes	Warning
Indexer Lag	Block indexer more than 100 blocks behind chain head	Warning
Rate Limit Spike	Rate limit rejections exceed 50/min	Warning
Pod Restarts	Pod restarted more than 3 times in 15 minutes	Warning
Memory Pressure	Pod memory usage exceeds 80% of limit	Warning
Disk Usage	PVC usage exceeds 80%	Warning

Informational Alerts¶

Alert	Condition	Severity
Manifest Updated	Manifest refreshed with route changes	Info
CronJob Failed	A click-runner CronJob did not complete successfully	Info
New Endpoint Registered	API endpoint count changed	Info

Notification Channels¶

Channel	Alert Severities	Purpose
`#cerebro-alerts` (Slack)	Critical, Warning	Immediate team notification
`#cerebro-info` (Slack)	Info	Non-urgent operational updates
Email distribution list	Critical	Escalation for outages

Health Checks¶

API Health Check¶

The cerebro-api exposes a health check at the root endpoint:

curl https://api.analytics.gnosis.io/

{
  "status": "online",
  "service": "Gnosis Cerebro Data API",
  "docs": "/docs"
}

This endpoint is used by:

Kubernetes readiness probe -- Determines if the pod should receive traffic
Kubernetes liveness probe -- Determines if the pod should be restarted
ALB health check -- Determines if the target is healthy

Container Health Checks¶

Each Docker container includes a built-in HEALTHCHECK instruction:

HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:8000/ || exit 1

Kubernetes Probes¶

readinessProbe:
  httpGet:
    path: /
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3

livenessProbe:
  httpGet:
    path: /
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 30
  failureThreshold: 5

Probe	Purpose	Failure Action
Readiness	Is the pod ready to serve traffic?	Remove from Service endpoints
Liveness	Is the pod still functioning?	Restart the pod

Dashboards¶

Key monitoring dashboards:

Dashboard	Metrics Displayed
API Overview	Request rate, error rate, latency distribution, endpoint counts
ClickHouse Performance	Query duration, rows scanned, connection pool, error rates
Indexer Status	Current block, chain head, lag, processing rate
Crawler Activity	Peers discovered, crawl duration, IP enrichment rate
Infrastructure	CPU/memory usage, pod status, node health

Next Steps¶

Troubleshooting -- Use monitoring data to diagnose issues
Infrastructure -- Underlying platform architecture
Deployment -- Deployment procedures and configuration