Observability¶

Cerebro MCP provides comprehensive observability through Prometheus metrics, structured JSON logs, reasoning traces, and a ready-to-import Grafana dashboard.

Grafana Dashboard¶

A complete Grafana dashboard is maintained at grafana/cerebro-mcp-observability.json in the Cerebro MCP repository.

Importing¶

Open your Grafana instance
Go to Dashboards > Import
Upload or paste the JSON file
Configure the template variables to match your Prometheus/Loki datasources

Dashboard Sections¶

The dashboard contains 9 row sections with 61 panels:

Overview¶

Kubernetes deployment health: desired/available replicas, active pods, total restarts, pod/node mapping, container images.

HTTP / SSE¶

HTTP transport metrics: request rate by path and status code, p95 latency by path, in-progress requests, 4xx/5xx error rates.

MCP Internals¶

MCP protocol metrics: request rate by method, p95 MCP latency, tool call rate by tool and status, p95 tool latency, top 10 failing tools.

Security Audit¶

Security detection metrics:

KPI stats: suspicious calls, high-risk tool calls, app-only calls, report auth denials
Trends: suspicious calls over time by flag type, high-risk calls by risk class (stacked area)
Breakdown: top high-risk tools by volume (bar gauge)
Live log: security audit events from Loki

See Security & Audit for details on the risk classification and detection system.

Tool Usage Details¶

Granular tool analytics:

Top 15 tools by volume (bar gauge)
Tool call distribution (pie chart)
Error rate by tool (timeseries)
Slowest tools p99 latency (table)

Semantic Layer¶

Semantic execution health:

State: semantic enabled flag, registry model/metric counts, snapshot age with yellow (>5 min) / red (>10 min) thresholds
Query flow: semantic query attempts by result (stacked), route decisions (pie chart)
Performance: end-to-end semantic latency p95, planner failures by reason

ClickHouse¶

Database query metrics: query rate by database/status, p95 latency, error rate, p95 rows returned.

Pod Resources¶

Container resource usage: CPU usage/requests/limits, CPU throttling, memory usage/requests/limits, network RX/TX, pod restarts over time.

Logs¶

Structured log exploration from Loki:

Warnings/errors volume timeline
Live structured logs (all events)
Tool call logs (mcp_tool_call events)
MCP request logs (mcp_request events)
ClickHouse query logs
Failed events
Security audit log (security_audit events)
Artifact reload log (artifact_reload events)

Template Variables¶

Variable	Type	Default	Description
`$prometheus`	Datasource	`thanos-gnosisanalytics`	Prometheus datasource
`$loki`	Datasource	`loki-gnosisanalytics`	Loki datasource
`$cluster`	Query	(auto)	Kubernetes cluster
`$namespace`	Query	`analytics-preview`	Kubernetes namespace
`$workload`	Custom	`cerebro-mcp`	Deployment name
`$pod`	Query	All	Pod selector

Prometheus Metrics Reference¶

All metrics are exposed at the /metrics endpoint (unauthenticated, exempt from bearer auth middleware in SSE mode).

HTTP Transport¶

Metric	Type	Labels
`cerebro_http_requests_total`	Counter	`method`, `path`, `status`
`cerebro_http_request_duration_seconds`	Histogram	`method`, `path`
`cerebro_http_requests_in_progress`	Gauge	`method`, `path`

MCP Protocol¶

Metric	Type	Labels
`cerebro_mcp_requests_total`	Counter	`method`, `status`
`cerebro_mcp_request_duration_seconds`	Histogram	`method`
`cerebro_mcp_tool_calls_total`	Counter	`tool_name`, `status`
`cerebro_mcp_tool_duration_seconds`	Histogram	`tool_name`

ClickHouse¶

Metric	Type	Labels
`cerebro_clickhouse_query_duration_seconds`	Histogram	`database`, `audience`, `fetch_mode`, `status`
`cerebro_clickhouse_query_errors_total`	Counter	`database`, `audience`
`cerebro_clickhouse_rows_returned`	Histogram	`database`, `audience`

Security Audit¶

Metric	Type	Labels
`cerebro_security_high_risk_tool_calls_total`	Counter	`tool_name`, `risk_class`, `transport`
`cerebro_security_suspicious_calls_total`	Counter	`tool_name`, `flag_type`
`cerebro_security_app_only_calls_total`	Counter	`tool_name`, `transport`
`cerebro_report_token_auth_total`	Counter	`status`

Quality Gates (report enforcement)¶

Metric	Type	Labels
`cerebro_quality_gate_evaluations_total`	Counter	`gate_name`, `outcome` (`pass` / `fail`)
`cerebro_quality_report_generations_total`	Counter	`report_kind`, `outcome`
`cerebro_discovered_model_coverage_total`	Counter	`coverage_kind` (`queried` / `excluded` / `unused`)

These power the quality_metrics MCP tool's markdown summary. See Quality Gates for the full ruleset.

Workflows (Phase 3)¶

Metric	Type	Labels
`cerebro_workflow_resume_outcomes_total`	Counter	`kind`, `action` (`ready_to_resume` / `complete` / `failed` / `orphan` / `no_handler`)
`cerebro_workflow_event_appends_total`	Counter	`kind`, `event_kind`, `status`
`cerebro_workflow_event_payload_bytes`	Histogram	`kind`, `compressed`
`cerebro_workflow_orphan_sweep_seconds`	Histogram	—

Sandboxes (Phase 2)¶

Metric	Type	Labels
`cerebro_sandbox_create_total`	Counter	`outcome`
`cerebro_sandbox_query_total`	Counter	`outcome`
`cerebro_sandbox_export_bytes`	Histogram	—
`cerebro_sandbox_active`	Gauge	—

Semantic Layer¶

Metric	Type	Labels
`semantic_tool_calls_total`	Counter	`tool_name`, `status`, `agent_role`, `entrypoint`
`semantic_query_attempts_total`	Counter	`planner_mode`, `attempt`, `result`, `agent_role`
`semantic_query_repairs_total`	Counter	`repair_action`, `error_class`, `agent_role`
`semantic_planner_failures_total`	Counter	`reason`, `planner_mode`, `agent_role`
`semantic_fallback_total`	Counter	`fallback_target`, `reason`, `agent_role`
`semantic_route_total`	Counter	`route`, `mode`
`semantic_bypass_total`	Counter	`stage`, `reason`
`semantic_snapshot_reload_total`	Counter	`status`
`semantic_snapshot_stale_total`	Counter	`reason`
`semantic_planner_latency_seconds`	Histogram	`planner_mode`
`semantic_sql_compile_latency_seconds`	Histogram	`planner_mode`
`semantic_query_end_to_end_latency_seconds`	Histogram	`planner_mode`, `repair_state`
`semantic_snapshot_reload_latency_seconds`	Histogram	`status`
`semantic_snapshot_age_seconds`	Gauge	(none)
`semantic_registry_models_total`	Gauge	`semantic_status`
`semantic_registry_metrics_total`	Gauge	`quality_tier`
`semantic_registry_relationships_total`	Gauge	`quality_tier`
`semantic_semantic_enabled`	Gauge	`state`

Structured JSON Logging¶

All server logs are emitted as single-line JSON via JsonFormatter to stderr. Each line includes:

{
  "timestamp": "2026-04-10T14:32:01.123456+00:00",
  "level": "INFO",
  "logger": "cerebro_mcp.tools.reasoning",
  "message": "mcp_tool_call",
  "event": "mcp_tool_call",
  "tool_name": "execute_query",
  "duration_ms": 142,
  "success": true
}

Event Types¶

Event	Source	When
`mcp_tool_call`	`tools/reasoning.py`	Every tool invocation
`mcp_request`	`tools/reasoning.py`	Low-level MCP protocol requests
`clickhouse_query`	`clickhouse_client.py`	ClickHouse query execution
`security_audit`	`security.py`	Suspicious tool calls (non-empty flags)
`report_token_auth`	`server.py`	Report endpoint access attempts
`artifact_reload`	`artifact_loader.py`	Artifact load/reload with hash and source
`transport_selected`	`server.py`	Server startup transport choice

Useful Loki Queries¶

All structured logsTool callsSecurity eventsArtifact reloadsFailed eventsErrors and warnings

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |= "\"timestamp\":\""

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |= "\"event\":\"mcp_tool_call\""

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |= "\"event\":\"security_audit\""

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |= "\"event\":\"artifact_reload\""

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |= "\"success\":false"

{namespace="$namespace", pod=~"cerebro-mcp-.*"} |~ "\"level\":\"(ERROR|WARNING)\""

Reasoning Traces¶

The reasoning/tracing system captures detailed tool execution traces for debugging and performance analysis. Traces are separate from the security JSONL audit log.

What is captured¶

Tool name, redacted arguments, and redacted result
Timing (start timestamp, duration)
Success/failure status with error details
Low-level MCP request/response pairs

Storage¶

Traces are persisted as session JSON files in THINKING_LOG_DIR (default .cerebro/logs/). Retention is controlled by THINKING_LOG_RETENTION_DAYS (default 30 days).

Tools¶

Tool	Description
`set_thinking_mode(enabled)`	Start or stop a trace session
`log_reasoning(step, content)`	Add a manual reasoning step
`get_reasoning_log(session_id)`	Retrieve a full session trace
`get_performance_stats(last_n)`	Aggregate performance metrics across recent sessions

Configuration¶

Variable	Default	Description
`THINKING_MODE_ENABLED`	`True`	Enable the reasoning trace system
`THINKING_ALWAYS_ON`	`True`	Auto-capture tool calls without explicit `set_thinking_mode`
`THINKING_LOG_DIR`	`.cerebro/logs`	Trace storage directory
`THINKING_LOG_RETENTION_DAYS`	`30`	Automatic trace cleanup after N days