Repository: ColeMurray/claude-code-otel Branch: main Commit: 43263654a369 Files: 13 Total size: 78.3 KB Directory structure: gitextract_4qznzh4v/ ├── .gitignore ├── CLAUDE_OBSERVABILITY.md ├── CONTRIBUTING.md ├── LICENSE ├── Makefile ├── README.md ├── claude-code-dashboard.json ├── collector-config.yaml ├── docker-compose-lgtm.yml ├── docker-compose.yml ├── grafana-dashboards.yml ├── grafana-datasources.yml └── prometheus.yml ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Development artifacts *.pyc __pycache__/ .pytest_cache/ *.egg-info/ dist/ build/ # Environment files .env .env.local .env.*.local # IDE and editor files .vscode/ .idea/ *.swp *.swo *~ # OS generated files .DS_Store .DS_Store? ._* .Spotlight-V100 .Trashes ehthumbs.db Thumbs.db # Python virtual environments venv/ env/ ENV/ # Analysis outputs analysis_results/ *.log # Docker volumes (if persisted locally) prometheus_data/ grafana_data/ loki_data/ # Temporary files tmp/ temp/ ================================================ FILE: CLAUDE_OBSERVABILITY.md ================================================ # Monitoring > Learn how to enable and configure OpenTelemetry for Claude Code. Claude Code supports OpenTelemetry (OTel) metrics and events for monitoring and observability. All metrics are time series data exported via OpenTelemetry's standard metrics protocol, and events are exported via OpenTelemetry's logs/events protocol. It is the user's responsibility to ensure their metrics and logs backends are properly configured and that the aggregation granularity meets their monitoring requirements. OpenTelemetry support is currently in beta and details are subject to change. ## Quick Start Configure OpenTelemetry using environment variables: ```bash # 1. Enable telemetry export CLAUDE_CODE_ENABLE_TELEMETRY=1 # 2. Choose exporters (both are optional - configure only what you need) export OTEL_METRICS_EXPORTER=otlp # Options: otlp, prometheus, console export OTEL_LOGS_EXPORTER=otlp # Options: otlp, console # 3. Configure OTLP endpoint (for OTLP exporter) export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 # 4. Set authentication (if required) export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer your-token" # 5. For debugging: reduce export intervals export OTEL_METRIC_EXPORT_INTERVAL=10000 # 10 seconds (default: 60000ms) export OTEL_LOGS_EXPORT_INTERVAL=5000 # 5 seconds (default: 5000ms) # 6. Run Claude Code claude ``` The default export intervals are 60 seconds for metrics and 5 seconds for logs. During setup, you may want to use shorter intervals for debugging purposes. Remember to reset these for production use. For full configuration options, see the [OpenTelemetry specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/exporter.md#configuration-options). ## Administrator Configuration Administrators can configure OpenTelemetry settings for all users through the managed settings file. This allows for centralized control of telemetry settings across an organization. See the [settings precedence](/en/docs/claude-code/settings#settings-precedence) for more information about how settings are applied. The managed settings file is located at: * macOS: `/Library/Application Support/ClaudeCode/managed-settings.json` * Linux: `/etc/claude-code/managed-settings.json` Example managed settings configuration: ```json { "env": { "CLAUDE_CODE_ENABLE_TELEMETRY": "1", "OTEL_METRICS_EXPORTER": "otlp", "OTEL_LOGS_EXPORTER": "otlp", "OTEL_EXPORTER_OTLP_PROTOCOL": "grpc", "OTEL_EXPORTER_OTLP_ENDPOINT": "http://collector.company.com:4317", "OTEL_EXPORTER_OTLP_HEADERS": "Authorization=Bearer company-token" } } ``` Managed settings can be distributed via MDM (Mobile Device Management) or other device management solutions. Environment variables defined in the managed settings file have high precedence and cannot be overridden by users. ## Configuration Details ### Common Configuration Variables | Environment Variable | Description | Example Values | | ----------------------------------------------- | --------------------------------------------------------- | ------------------------------------ | | `CLAUDE_CODE_ENABLE_TELEMETRY` | Enables telemetry collection (required) | `1` | | `OTEL_METRICS_EXPORTER` | Metrics exporter type(s) (comma-separated) | `console`, `otlp`, `prometheus` | | `OTEL_LOGS_EXPORTER` | Logs/events exporter type(s) (comma-separated) | `console`, `otlp` | | `OTEL_EXPORTER_OTLP_PROTOCOL` | Protocol for OTLP exporter (all signals) | `grpc`, `http/json`, `http/protobuf` | | `OTEL_EXPORTER_OTLP_ENDPOINT` | OTLP collector endpoint (all signals) | `http://localhost:4317` | | `OTEL_EXPORTER_OTLP_METRICS_PROTOCOL` | Protocol for metrics (overrides general) | `grpc`, `http/json`, `http/protobuf` | | `OTEL_EXPORTER_OTLP_METRICS_ENDPOINT` | OTLP metrics endpoint (overrides general) | `http://localhost:4318/v1/metrics` | | `OTEL_EXPORTER_OTLP_LOGS_PROTOCOL` | Protocol for logs (overrides general) | `grpc`, `http/json`, `http/protobuf` | | `OTEL_EXPORTER_OTLP_LOGS_ENDPOINT` | OTLP logs endpoint (overrides general) | `http://localhost:4318/v1/logs` | | `OTEL_EXPORTER_OTLP_HEADERS` | Authentication headers for OTLP | `Authorization=Bearer token` | | `OTEL_EXPORTER_OTLP_METRICS_CLIENT_KEY` | Client key for mTLS authentication | Path to client key file | | `OTEL_EXPORTER_OTLP_METRICS_CLIENT_CERTIFICATE` | Client certificate for mTLS authentication | Path to client cert file | | `OTEL_METRIC_EXPORT_INTERVAL` | Export interval in milliseconds (default: 60000) | `5000`, `60000` | | `OTEL_LOGS_EXPORT_INTERVAL` | Logs export interval in milliseconds (default: 5000) | `1000`, `10000` | | `OTEL_LOG_USER_PROMPTS` | Enable logging of user prompt content (default: disabled) | `1` to enable | ### Metrics Cardinality Control The following environment variables control which attributes are included in metrics to manage cardinality: | Environment Variable | Description | Default Value | Example to Disable | | ----------------------------------- | ----------------------------------------------- | ------------- | ------------------ | | `OTEL_METRICS_INCLUDE_SESSION_ID` | Include session.id attribute in metrics | `true` | `false` | | `OTEL_METRICS_INCLUDE_VERSION` | Include app.version attribute in metrics | `false` | `true` | | `OTEL_METRICS_INCLUDE_ACCOUNT_UUID` | Include user.account\_uuid attribute in metrics | `true` | `false` | These variables help control the cardinality of metrics, which affects storage requirements and query performance in your metrics backend. Lower cardinality generally means better performance and lower storage costs but less granular data for analysis. ### Example Configurations ```bash # Console debugging (1-second intervals) export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=console export OTEL_METRIC_EXPORT_INTERVAL=1000 # OTLP/gRPC export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 # Prometheus export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=prometheus # Multiple exporters export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=console,otlp export OTEL_EXPORTER_OTLP_PROTOCOL=http/json # Different endpoints/backends for metrics and logs export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=otlp export OTEL_LOGS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf export OTEL_EXPORTER_OTLP_METRICS_ENDPOINT=http://metrics.company.com:4318 export OTEL_EXPORTER_OTLP_LOGS_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_LOGS_ENDPOINT=http://logs.company.com:4317 # Metrics only (no events/logs) export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_METRICS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 # Events/logs only (no metrics) export CLAUDE_CODE_ENABLE_TELEMETRY=1 export OTEL_LOGS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 ``` ## Available Metrics and Events ### Metrics Claude Code exports the following metrics: | Metric Name | Description | Unit | | ------------------------------------- | ----------------------------------------------- | ------ | | `claude_code.session.count` | Count of CLI sessions started | count | | `claude_code.lines_of_code.count` | Count of lines of code modified | count | | `claude_code.pull_request.count` | Number of pull requests created | count | | `claude_code.commit.count` | Number of git commits created | count | | `claude_code.cost.usage` | Cost of the Claude Code session | USD | | `claude_code.token.usage` | Number of tokens used | tokens | | `claude_code.code_edit_tool.decision` | Count of code editing tool permission decisions | count | ### Metric Details All metrics share these standard attributes: * `session.id`: Unique session identifier (controlled by `OTEL_METRICS_INCLUDE_SESSION_ID`) * `app.version`: Current Claude Code version (controlled by `OTEL_METRICS_INCLUDE_VERSION`) * `organization.id`: Organization UUID (when authenticated) * `user.account_uuid`: Account UUID (when authenticated, controlled by `OTEL_METRICS_INCLUDE_ACCOUNT_UUID`) #### Session Counter Emitted at the start of each session. #### Lines of Code Counter Emitted when code is added or removed. Additional attribute: `type` (`"added"` or `"removed"`) #### Pull Request Counter Emitted when creating pull requests via Claude Code. #### Commit Counter Emitted when creating git commits via Claude Code. #### Cost Counter Emitted after each API request. Additional attribute: `model` #### Token Counter Emitted after each API request. Additional attributes: `type` (`"input"`, `"output"`, `"cacheRead"`, `"cacheCreation"`) and `model` #### Code Edit Tool Decision Counter Emitted when user accepts or rejects Edit, MultiEdit, Write, or NotebookEdit tool usage. Additional attributes: `tool` (tool name: `"Edit"`, `"MultiEdit"`, `"Write"`, `"NotebookEdit"`) and `decision` (`"accept"`, `"reject"`) ### Events Claude Code exports the following events via OpenTelemetry logs/events (when `OTEL_LOGS_EXPORTER` is configured): #### User Prompt Event * **Event Name**: `claude_code.user_prompt` * **Description**: Logged when a user submits a prompt * **Attributes**: * All standard attributes (user.id, session.id, etc.) * `event.name`: `"user_prompt"` * `event.timestamp`: ISO 8601 timestamp * `prompt_length`: Length of the prompt * `prompt`: Prompt content (redacted by default, enable with `OTEL_LOG_USER_PROMPTS=1`) #### Tool Result Event * **Event Name**: `claude_code.tool_result` * **Description**: Logged when a tool completes execution * **Attributes**: * All standard attributes * `event.name`: `"tool_result"` * `event.timestamp`: ISO 8601 timestamp * `name`: Name of the tool * `success`: `"true"` or `"false"` * `duration_ms`: Execution time in milliseconds * `error`: Error message (if failed) #### API Request Event * **Event Name**: `claude_code.api_request` * **Description**: Logged for each API request to Claude * **Attributes**: * All standard attributes * `event.name`: `"api_request"` * `event.timestamp`: ISO 8601 timestamp * `model`: Model used (e.g., "claude-3-5-sonnet-20241022") * `cost_usd`: Estimated cost in USD * `duration_ms`: Request duration in milliseconds * `input_tokens`: Number of input tokens * `output_tokens`: Number of output tokens * `cache_read_tokens`: Number of tokens read from cache * `cache_creation_tokens`: Number of tokens used for cache creation #### API Error Event * **Event Name**: `claude_code.api_error` * **Description**: Logged when an API request to Claude fails * **Attributes**: * All standard attributes * `event.name`: `"api_error"` * `event.timestamp`: ISO 8601 timestamp * `model`: Model used (e.g., "claude-3-5-sonnet-20241022") * `error`: Error message * `status_code`: HTTP status code (if applicable) * `duration_ms`: Request duration in milliseconds * `attempt`: Attempt number (for retried requests) #### Tool Decision Event * **Event Name**: `claude_code.tool_decision` * **Description**: Logged when a tool permission decision is made (accept/reject) * **Attributes**: * All standard attributes * `event.name`: `"tool_decision"` * `event.timestamp`: ISO 8601 timestamp * `tool_name`: Name of the tool (e.g., "Read", "Edit", "MultiEdit", "Write", "NotebookEdit", etc.) * `decision`: Either `"accept"` or `"reject"` * `source`: Decision source - `"config"`, `"user_permanent"`, `"user_temporary"`, `"user_abort"`, or `"user_reject"` ## Interpreting Metrics and Events Data The metrics exported by Claude Code provide valuable insights into usage patterns and productivity. Here are some common visualizations and analyses you can create: ### Usage Monitoring | Metric | Analysis Opportunity | | ------------------------------------------------------------- | --------------------------------------------------------- | | `claude_code.token.usage` | Break down by `type` (input/output), user, team, or model | | `claude_code.session.count` | Track adoption and engagement over time | | `claude_code.lines_of_code.count` | Measure productivity by tracking code additions/removals | | `claude_code.commit.count` & `claude_code.pull_request.count` | Understand impact on development workflows | ### Cost Monitoring The `claude_code.cost.usage` metric helps with: * Tracking usage trends across teams or individuals * Identifying high-usage sessions for optimization Cost metrics are approximations. For official billing data, refer to your API provider (Anthropic Console, AWS Bedrock, or Google Cloud Vertex). ### Alerting and Segmentation Common alerts to consider: * Cost spikes * Unusual token consumption * High session volume from specific users All metrics can be segmented by `user.account_uuid`, `organization.id`, `session.id`, `model`, and `app.version`. ### Event Analysis The event data provides detailed insights into Claude Code interactions: **Tool Usage Patterns**: Analyze tool result events to identify: * Most frequently used tools * Tool success rates * Average tool execution times * Error patterns by tool type **Performance Monitoring**: Track API request durations and tool execution times to identify performance bottlenecks. ## Backend Considerations Your choice of metrics and logs backends will determine the types of analyses you can perform: ### For Metrics: * **Time series databases (e.g., Prometheus)**: Rate calculations, aggregated metrics * **Columnar stores (e.g., ClickHouse)**: Complex queries, unique user analysis * **Full-featured observability platforms (e.g., Honeycomb, Datadog)**: Advanced querying, visualization, alerting ### For Events/Logs: * **Log aggregation systems (e.g., Elasticsearch, Loki)**: Full-text search, log analysis * **Columnar stores (e.g., ClickHouse)**: Structured event analysis * **Full-featured observability platforms (e.g., Honeycomb, Datadog)**: Correlation between metrics and events For organizations requiring Daily/Weekly/Monthly Active User (DAU/WAU/MAU) metrics, consider backends that support efficient unique value queries. ## Service Information All metrics are exported with: * Service Name: `claude-code` * Service Version: Current Claude Code version * Meter Name: `com.anthropic.claude_code` ## Security/Privacy Considerations * Telemetry is opt-in and requires explicit configuration * Sensitive information like API keys or file contents are never included in metrics or events * User prompt content is redacted by default - only prompt length is recorded. To enable user prompt logging, set `OTEL_LOG_USER_PROMPTS=1` ================================================ FILE: CONTRIBUTING.md ================================================ # Contributing to Claude Code Observability Stack Thank you for your interest in contributing to the Claude Code Observability Stack! This project helps developers monitor and analyze their Claude Code usage through comprehensive dashboards and metrics. ## 🚀 Getting Started ### Prerequisites - Docker and Docker Compose - Basic understanding of OpenTelemetry, Prometheus, and Grafana - Familiarity with Claude Code and its telemetry features ### Development Setup 1. Fork the repository 2. Clone your fork: `git clone https://github.com/your-username/claude-code-otel.git` - Original repository: `git clone https://github.com/ColeMurray/claude-code-otel.git` 3. Start the development stack: `make up` 4. Access Grafana at http://localhost:3000 (admin/admin) ## 📊 How to Contribute ### Dashboard Improvements - Add new panels for additional Claude Code metrics - Improve existing visualizations for better insights - Optimize queries for better performance - Enhance color schemes and layouts ### Configuration Enhancements - Improve OpenTelemetry collector configurations - Add new Prometheus recording rules - Optimize data retention and storage ### Documentation - Improve setup instructions - Add troubleshooting guides - Create usage examples - Update metric documentation ## 🔍 Project Structure ``` ├── claude-code-dashboard.json # Main Grafana dashboard ├── collector-config.yaml # OpenTelemetry collector config ├── docker-compose.yml # Main stack configuration ├── prometheus.yml # Prometheus configuration ├── grafana-*.yml # Grafana configuration files ├── Makefile # Management commands └── README.md # Project documentation ``` ## 📋 Contribution Guidelines ### Code Quality 1. **Follow OpenTelemetry Standards**: Use proper metric naming conventions as defined in the Claude Code documentation 2. **Test Your Changes**: Verify that new configurations work with actual Claude Code telemetry data 3. **Documentation**: Update relevant documentation for any changes 4. **Security**: Never commit sensitive information (API keys, credentials, etc.) ### Dashboard Guidelines 1. **Consistent Styling**: Follow the existing color schemes and layout patterns 2. **Performance**: Optimize queries to avoid excessive resource usage 3. **Accessibility**: Use clear labels and legends for all visualizations 4. **Mobile-Friendly**: Ensure dashboards work well on different screen sizes ### Commit Messages Use clear, descriptive commit messages: - `feat: add API request count panel to cost analysis` - `fix: correct token usage query in performance dashboard` - `docs: update setup instructions for macOS` - `refactor: optimize Prometheus queries for better performance` ## 🧪 Testing Your Changes ### Before Submitting a PR 1. **Start the Stack**: `make up` 2. **Generate Test Data**: Use Claude Code with telemetry enabled 3. **Verify Dashboards**: Check that all panels display data correctly 4. **Test Configuration**: Ensure `make validate-config` passes 5. **Check Documentation**: Verify all links and instructions work ### Configuration Validation ```bash # Validate all configurations make validate-config # Test individual components docker compose config # Validate docker-compose.yml curl -f http://localhost:9090/-/healthy # Test Prometheus curl -f http://localhost:3000/api/health # Test Grafana ``` ## 📈 Types of Contributions ### Dashboard Enhancements - **New Metric Panels**: Add visualizations for additional Claude Code metrics - **Improved Layouts**: Better organization of dashboard sections - **Enhanced Queries**: More efficient or insightful Prometheus/LogQL queries - **Visual Improvements**: Better color schemes, legends, and formatting ### Infrastructure Improvements - **Performance Optimization**: Faster queries and reduced resource usage - **Configuration Management**: Better default configurations - **Documentation**: Clearer setup and troubleshooting guides - **Compatibility**: Support for different environments and setups ### Feature Additions - **New Metrics**: Support for additional Claude Code telemetry data - **Export Options**: Additional data export formats or integrations - **Monitoring Enhancements**: Better health checks and status monitoring ## 🐛 Reporting Issues When reporting issues, please include: 1. **Environment Details**: OS, Docker version, Claude Code version 2. **Steps to Reproduce**: Clear instructions to replicate the issue 3. **Expected vs Actual**: What you expected to happen vs what actually happened 4. **Logs**: Relevant logs from the observability stack 5. **Configuration**: Any custom configurations you're using ## 📚 Resources - [Claude Code Observability Documentation](CLAUDE_OBSERVABILITY.md) - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - [Prometheus Query Language](https://prometheus.io/docs/prometheus/latest/querying/) - [Grafana Dashboard Best Practices](https://grafana.com/docs/grafana/latest/best-practices/) - [LogQL Documentation](https://grafana.com/docs/loki/latest/logql/) ## 💬 Getting Help - **Issues**: Open a GitHub issue for bugs or feature requests - **Discussions**: Use GitHub Discussions for questions and ideas - **Documentation**: Check the README and CLAUDE_OBSERVABILITY.md first ## 📋 Pull Request Process 1. **Fork & Branch**: Create a feature branch from `main` 2. **Develop**: Make your changes following the guidelines above 3. **Test**: Verify your changes work as expected 4. **Document**: Update documentation if needed 5. **Submit**: Create a pull request with a clear description ### PR Description Template ```markdown ## Summary Brief description of changes ## Type of Change - [ ] Dashboard improvement - [ ] Configuration enhancement - [ ] Documentation update - [ ] Bug fix - [ ] New feature ## Testing - [ ] Tested with actual Claude Code telemetry data - [ ] Verified all dashboard panels work correctly - [ ] Configuration validation passes - [ ] Documentation is accurate ## Screenshots (if applicable) Include screenshots of dashboard changes ``` Thank you for contributing to the Claude Code Observability Stack! 🚀 ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2024 Claude Code Observability Stack Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: Makefile ================================================ # Claude Code Observability Stack .PHONY: help up down logs restart clean validate-config help: ## Show this help message @echo "Claude Code Observability Stack" @echo "================================" @echo "" @echo "Available commands:" @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-20s\033[0m %s\n", $$1, $$2}' up: ## Start the observability stack @echo "🚀 Starting Claude Code observability stack..." docker compose up -d @echo "✅ Stack started!" @echo "📊 Grafana: http://localhost:3000 (admin/admin)" @echo "🔍 Prometheus: http://localhost:9090" @echo "📄 Loki: http://localhost:3100" down: ## Stop the observability stack @echo "🛑 Stopping Claude Code observability stack..." docker compose down @echo "✅ Stack stopped!" restart: ## Restart the observability stack @echo "🔄 Restarting Claude Code observability stack..." docker compose restart @echo "✅ Stack restarted!" logs: ## Show logs from all services docker compose logs -f logs-collector: ## Show OpenTelemetry collector logs docker compose logs -f otel-collector logs-prometheus: ## Show Prometheus logs docker compose logs -f prometheus logs-grafana: ## Show Grafana logs docker compose logs -f grafana clean: ## Clean up containers and volumes @echo "🧹 Cleaning up..." docker compose down -v docker system prune -f @echo "✅ Cleanup complete!" validate-config: ## Validate all configuration files @echo "✅ Validating configurations..." @echo "📋 Checking docker compose.yml..." docker compose config > /dev/null && echo "✅ docker compose.yml is valid" @echo "📋 Checking collector-config.yaml..." @if command -v otelcol-contrib >/dev/null 2>&1; then \ otelcol-contrib --config-validate --config=collector-config.yaml; \ else \ echo "ℹ️ Install otelcol-contrib to validate collector config"; \ fi status: ## Show stack status @echo "📊 Claude Code Observability Stack Status" @echo "===========================================" @docker compose ps @echo "" @echo "🌐 Service URLs:" @echo " Grafana: http://localhost:3000" @echo " Prometheus: http://localhost:9090" @echo " Loki: http://localhost:3100" @echo " Collector: http://localhost:4317 (gRPC), http://localhost:4318 (HTTP)" setup-claude: ## Display Claude Code telemetry setup instructions @echo "🤖 Claude Code Telemetry Setup" @echo "===============================" @echo "" @echo "To enable telemetry in Claude Code, set these environment variables:" @echo "" @echo "export CLAUDE_CODE_ENABLE_TELEMETRY=1" @echo "export OTEL_METRICS_EXPORTER=otlp" @echo "export OTEL_LOGS_EXPORTER=otlp" @echo "export OTEL_EXPORTER_OTLP_PROTOCOL=grpc" @echo "export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317" @echo "" @echo "For debugging (faster export intervals):" @echo "export OTEL_METRIC_EXPORT_INTERVAL=10000" @echo "export OTEL_LOGS_EXPORT_INTERVAL=5000" @echo "" @echo "Then run: claude" demo-metrics: ## Generate demo metrics for testing @echo "🎯 This would generate demo metrics if Claude Code was running" @echo "💡 To see real metrics, ensure Claude Code is configured with telemetry enabled" @echo "📖 Run 'make setup-claude' for setup instructions" ================================================ FILE: README.md ================================================ # Claude Code Observability Stack [![GitHub](https://img.shields.io/badge/GitHub-ColeMurray%2Fclaude--code--otel-blue?logo=github)](https://github.com/ColeMurray/claude-code-otel) [![License](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) [![Docker](https://img.shields.io/badge/Docker-Ready-blue?logo=docker)](docker-compose.yml) A comprehensive observability solution for monitoring Claude Code usage, performance, and costs. This setup implements the recommendations from the [Claude Code Observability Documentation](CLAUDE_OBSERVABILITY.md) to provide deep insights into AI-assisted development workflows. ## 📸 Dashboard Screenshots ### 💰 Cost & Usage Analysis Track spending across different Claude models with detailed breakdowns of costs, API requests, and token usage patterns. Cost & Usage Analysis Dashboard *Features: Model cost comparison, API request tracking, token usage breakdown by type* ### 📊 User Activity & Productivity Monitor development productivity with comprehensive session analytics, tool usage patterns, and code change metrics. User Activity & Productivity Dashboard *Features: Session tracking, tool performance metrics, code productivity insights* ## 🎯 Features ### 📊 **Comprehensive Monitoring** - **Cost Analysis**: Track usage costs by model, user, and time periods - **User Analytics**: Daily/Weekly/Monthly Active Users (DAU/WAU/MAU) - **Tool Usage**: Monitor which Claude Code tools are used most frequently - **Performance Metrics**: API latency, success rates, and bottleneck identification - **Productivity Insights**: Lines of code changes, commits, and pull requests ### 📊 **Enhanced Analytics** - **API Request Tracking**: Monitor actual request counts by model version - **Token Efficiency**: Track cost-per-token across different models - **Session Analytics**: Comprehensive session and productivity tracking - **Real-time Monitoring**: Live dashboards with 30-second refresh rates ### 📈 **Rich Dashboards** - **Executive Overview**: High-level KPIs and trends - **Cost Management**: Detailed cost breakdowns and projections - **Tool Performance**: Success rates and execution times - **User Activity**: Productivity and engagement metrics - **Error Analysis**: Comprehensive error tracking and investigation ## 🏗️ Architecture ``` Claude Code → OpenTelemetry Collector → Prometheus (metrics) + Loki (events/logs) ↓ Grafana (visualization & analysis) ``` ### Components | Service | Purpose | Port | UI | |---------|---------|------|----| | **OpenTelemetry Collector** | Metrics/logs ingestion | 4317 (gRPC), 4318 (HTTP) | - | | **Prometheus** | Metrics storage & querying | 9090 | http://localhost:9090 | | **Loki** | Log aggregation & storage | 3100 | - | | **Grafana** | Dashboards & visualization | 3000 | http://localhost:3000 | ## 🚀 Quick Start ### 1. Start the Stack ```bash # Start all services make up # Check status make status ``` ### 2. Configure Claude Code ```bash # Enable telemetry export CLAUDE_CODE_ENABLE_TELEMETRY=1 # Configure exporters export OTEL_METRICS_EXPORTER=otlp export OTEL_LOGS_EXPORTER=otlp export OTEL_EXPORTER_OTLP_PROTOCOL=grpc export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 # For debugging (faster export intervals) export OTEL_METRIC_EXPORT_INTERVAL=10000 export OTEL_LOGS_EXPORT_INTERVAL=5000 # Run Claude Code claude ``` ### 3. Access Dashboards - **Grafana**: http://localhost:3000 (admin/admin) - **Prometheus**: http://localhost:9090 > 🖼️ **Visual Guide**: Check out the [Dashboard Screenshots](#-dashboard-screenshots) to see what your dashboards will look like! ## 📊 Available Metrics Based on the [Claude Code Observability Documentation](CLAUDE_OBSERVABILITY.md), this stack monitors: ### Core Metrics - `claude_code.session.count` - CLI sessions started - `claude_code.lines_of_code.count` - Lines of code modified (added/removed) - `claude_code.pull_request.count` - Pull requests created - `claude_code.commit.count` - Git commits created - `claude_code.cost.usage` - Cost of sessions by model - `claude_code.token.usage` - Token usage (input/output/cache/creation) - `claude_code.code_edit_tool.decision` - Tool permission decisions ### Event Data - `claude_code.user_prompt` - User prompt submissions - `claude_code.tool_result` - Tool execution results and timings - `claude_code.api_request` - API requests with duration and tokens - `claude_code.api_error` - API errors with status codes - `claude_code.tool_decision` - Tool permission decisions ## 🔍 Usage Analysis ### Real-time Dashboard Analysis Access comprehensive analytics through the Grafana dashboard at http://localhost:3000: - **Cost Analysis**: Real-time cost tracking with model breakdowns - **Request Monitoring**: API request counts and patterns by model - **Token Efficiency**: Track token usage and cost-per-token metrics - **Tool Performance**: Success rates and execution time analysis - **Session Analytics**: User activity and productivity insights ### Key Metrics Available - Total and per-model costs with trending - API request counts independent of cost variations - Token usage breakdown (input/output/cache/creation) - Tool usage patterns and success rates - Session activity and code productivity metrics ## 📊 Key Dashboard Features > 💡 **See [Dashboard Screenshots](#-dashboard-screenshots) above for visual examples** ### 💰 Cost & Usage Analysis - **Cost by Model**: Track spending across different Claude models - **API Request Tracking**: Monitor actual request counts by model version - **Token Usage Breakdown**: Detailed analysis by token type (input/output/cache) ### 🔧 Tool Performance - **Usage Patterns**: Most frequently used Claude Code tools - **Success Rates**: Tool execution success percentages - **Performance Metrics**: Average execution times and bottleneck identification ### ⚡ Real-time Monitoring - **Live Metrics**: 30-second refresh rate for current activity - **Session Tracking**: Active sessions and productivity metrics - **Error Analysis**: API errors and troubleshooting information ## 📋 Dashboard Sections The Grafana dashboard is organized into sections reflecting the observability documentation recommendations: ### 📊 Overview - Active sessions, cost, token usage, lines of code changed ### 💰 Cost & Usage Analysis - Cost trends by model, token usage breakdown - **NEW**: API request count tracking by model version - Implements cost monitoring recommendations ### 🔧 Tool Usage & Performance - Tool frequency and success rates - Performance bottleneck identification ### ⚡ Performance & Errors - API latency by model, error rate tracking - Performance monitoring as recommended ### 📝 User Activity & Productivity - Code changes, commits, pull requests - Productivity measurement insights ### 🔍 Event Logs - Real-time tool execution events and API errors - Structured log analysis for troubleshooting ## 🔧 Advanced Configuration ### Environment Variables Key configuration options (see [CLAUDE_OBSERVABILITY.md](CLAUDE_OBSERVABILITY.md) for complete reference): ```bash # Core telemetry CLAUDE_CODE_ENABLE_TELEMETRY=1 # Exporter configuration OTEL_METRICS_EXPORTER=otlp,prometheus # Multiple exporters OTEL_LOGS_EXPORTER=otlp # Protocol and endpoints OTEL_EXPORTER_OTLP_PROTOCOL=grpc OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317 OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer token" # Export intervals OTEL_METRIC_EXPORT_INTERVAL=60000 # 1 minute (production) OTEL_LOGS_EXPORT_INTERVAL=5000 # 5 seconds # Privacy controls OTEL_LOG_USER_PROMPTS=1 # Enable prompt content logging # Cardinality control OTEL_METRICS_INCLUDE_SESSION_ID=true OTEL_METRICS_INCLUDE_VERSION=false OTEL_METRICS_INCLUDE_ACCOUNT_UUID=true ``` ### Collector Configuration The OpenTelemetry collector is configured with: - **Processors**: Resource enrichment and event filtering - **Multiple Pipelines**: Separate routing for metrics and different event types - **Metric Relabeling**: Cardinality control for better performance ### Backend Considerations Following the documentation recommendations: - **Metrics Backend**: Prometheus (time series) + optional columnar stores - **Events Backend**: Loki (log aggregation) with JSON parsing - **Cardinality Management**: Configurable attribute inclusion - **Retention**: Configure based on your analysis needs ## 🛠️ Management Commands ```bash # Stack management make up # Start all services make down # Stop all services make restart # Restart services make clean # Clean up containers and volumes # Monitoring make logs # View all logs make logs-collector # View collector logs only make status # Show service status # Validation make validate-config # Validate all configs make setup-claude # Show Claude Code setup instructions ``` ## 🎯 Use Cases ### For Engineering Teams - **Cost Management**: Track AI assistance costs by team/project - **Productivity Measurement**: Quantify development velocity improvements - **Tool Adoption**: Understand which Claude Code features drive value - **Performance Optimization**: Identify and resolve usage bottlenecks ### For Platform Teams - **Capacity Planning**: Predict infrastructure needs based on usage growth - **SLA Monitoring**: Track API performance and availability - **Security**: Monitor unusual usage patterns - **Resource Optimization**: Optimize token usage and reduce costs ### For Management - **ROI Analysis**: Measure productivity gains from AI assistance - **Usage Insights**: Understand adoption patterns across teams - **Cost Control**: Monitor and optimize AI assistance spending - **Strategic Planning**: Data-driven decisions on AI tool investments ## 🔒 Security & Privacy - **User Privacy**: Prompt content logging is disabled by default - **Data Isolation**: All data stays within your infrastructure - **Access Control**: Configure Grafana authentication as needed - **Audit Trail**: Complete logging of all tool usage and decisions ## 📚 Resources - [Claude Code Observability Documentation](CLAUDE_OBSERVABILITY.md) - Complete reference - [OpenTelemetry Documentation](https://opentelemetry.io/docs/) - OTel specification - [Prometheus Documentation](https://prometheus.io/docs/) - Metrics and alerting - [Grafana Documentation](https://grafana.com/docs/) - Dashboards and visualization - [Loki Documentation](https://grafana.com/docs/loki/) - Log aggregation ## 🤝 Contributing This observability stack implements the patterns and recommendations from the official Claude Code documentation. To contribute: 1. Follow the metric naming conventions in the documentation 2. Update dashboards to reflect new data sources and metrics 3. Test configurations before submitting changes 4. Ensure all sensitive information is excluded from commits 5. Update documentation for any new features or configuration changes ## 📄 License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## 🙏 Acknowledgments - Built following the [Claude Code Observability Documentation](CLAUDE_OBSERVABILITY.md) - Uses OpenTelemetry standards for metrics and events - Implements industry best practices for observability stack architecture ================================================ FILE: claude-code-dashboard.json ================================================ { "annotations": { "list": [ { "builtIn": 1, "datasource": { "type": "grafana", "uid": "-- Grafana --" }, "enable": true, "hide": true, "iconColor": "rgba(0, 211, 255, 1)", "name": "Annotations & Alerts", "type": "dashboard" } ] }, "description": "Claude Code Observability Dashboard - Monitor usage, costs, and performance", "editable": true, "fiscalYearStartMonth": 0, "graphTooltip": 0, "id": null, "links": [], "liveNow": false, "panels": [ { "title": "📊 Overview", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 0}, "collapsed": false }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 10 }, { "color": "red", "value": 50 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 4, "w": 6, "x": 0, "y": 1 }, "id": 1, "options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "orientation": "auto", "reduceOptions": { "values": false, "calcs": ["lastNotNull"], "fields": "" }, "textMode": "auto" }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_session_count_total{job=\"otel-collector\"}[1h]))", "interval": "", "legendFormat": "Sessions (1h)", "refId": "A" } ], "title": "Active Sessions (1h)", "type": "stat" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 5 }, { "color": "red", "value": 20 } ] }, "unit": "currencyUSD" }, "overrides": [] }, "gridPos": { "h": 4, "w": 6, "x": 6, "y": 1 }, "id": 2, "options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "orientation": "auto", "reduceOptions": { "values": false, "calcs": ["lastNotNull"], "fields": "" }, "textMode": "auto" }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_cost_usage_USD_total{job=\"otel-collector\"}[1h]))", "interval": "", "legendFormat": "Cost (1h)", "refId": "A" } ], "title": "Cost (Last Hour)", "type": "stat" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 10000 }, { "color": "red", "value": 50000 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 4, "w": 6, "x": 12, "y": 1 }, "id": 3, "options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "orientation": "auto", "reduceOptions": { "values": false, "calcs": ["lastNotNull"], "fields": "" }, "textMode": "auto" }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_token_usage_tokens_total{job=\"otel-collector\"}[1h]))", "interval": "", "legendFormat": "Tokens (1h)", "refId": "A" } ], "title": "Token Usage (1h)", "type": "stat" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "thresholds" }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 100 }, { "color": "red", "value": 500 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 4, "w": 6, "x": 18, "y": 1 }, "id": 4, "options": { "colorMode": "background", "graphMode": "area", "justifyMode": "center", "orientation": "auto", "reduceOptions": { "values": false, "calcs": ["lastNotNull"], "fields": "" }, "textMode": "auto" }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_lines_of_code_count_total{job=\"otel-collector\"}[1h]))", "interval": "", "legendFormat": "Lines Changed (1h)", "refId": "A" } ], "title": "Lines of Code (1h)", "type": "stat" }, { "title": "💰 Cost & Usage Analysis", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 5}, "collapsed": false }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "USD", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "currencyUSD" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 6 }, "id": 5, "options": { "legend": { "calcs": ["max", "mean"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum by (model) (increase(claude_code_cost_usage_USD_total{job=\"otel-collector\"}[1h]))", "interval": "", "legendFormat": "{{model}}", "refId": "A" } ], "title": "Cost by Model (Hourly)", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Tokens (log scale)", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "log", "log": 10 }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byName", "options": "cacheRead" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "blue" } }, { "id": "custom.lineWidth", "value": 3 } ] }, { "matcher": { "id": "byName", "options": "output" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "green" } }, { "id": "custom.lineWidth", "value": 3 } ] }, { "matcher": { "id": "byName", "options": "input" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "orange" } }, { "id": "custom.lineWidth", "value": 2 } ] }, { "matcher": { "id": "byName", "options": "cacheCreation" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "purple" } }, { "id": "custom.lineWidth", "value": 2 } ] } ] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 6 }, "id": 6, "options": { "legend": { "calcs": ["max", "mean"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum by (type) (rate(claude_code_token_usage_tokens_total{job=\"otel-collector\"}[5m]) * 60)", "interval": "", "legendFormat": "{{type}}", "refId": "A" } ], "title": "Token Usage Rate by Type", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Requests", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byRegexp", "options": ".*sonnet.*" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "purple" } }, { "id": "custom.lineWidth", "value": 3 } ] }, { "matcher": { "id": "byRegexp", "options": ".*haiku.*" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "blue" } }, { "id": "custom.lineWidth", "value": 3 } ] } ] }, "gridPos": { "h": 8, "w": 24, "x": 0, "y": 14 }, "id": 15, "options": { "legend": { "calcs": ["lastNotNull", "max", "mean"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum by (model) (changes(claude_code_cost_usage_USD_total{job=\"otel-collector\"}[5m]))", "interval": "", "legendFormat": "{{model}}", "refId": "A" } ], "title": "API Requests by Model (5min rate)", "type": "timeseries" }, { "title": "🔧 Tool Usage & Performance", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 22}, "collapsed": false }, { "datasource": { "type": "loki", "uid": "loki" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Tool Usage Rate (per 5min)", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "stepAfter", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "yellow", "value": 5 }, { "color": "red", "value": 10 } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byName", "options": "Bash" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "red" } }, { "id": "custom.lineWidth", "value": 3 } ] }, { "matcher": { "id": "byName", "options": "Read" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "blue" } } ] }, { "matcher": { "id": "byName", "options": "Write" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "green" } } ] }, { "matcher": { "id": "byName", "options": "Grep" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "orange" } } ] } ] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 23 }, "id": 7, "options": { "legend": { "calcs": ["lastNotNull", "mean"], "displayMode": "table", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "sum by (tool_name) (count_over_time({service_name=\"claude-code\"} |= \"claude_code.tool_result\" [5m]))", "interval": "", "legendFormat": "{{tool_name}}", "refId": "A" } ], "title": "Tool Usage Rate Over Time", "type": "timeseries" }, { "datasource": { "type": "loki", "uid": "loki" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Total Tool Count", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 0, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "stepAfter", "lineWidth": 2, "pointSize": 4, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null } ] }, "unit": "short" }, "overrides": [ { "matcher": { "id": "byName", "options": "Bash" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "red" } }, { "id": "custom.lineWidth", "value": 3 } ] }, { "matcher": { "id": "byName", "options": "Read" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "blue" } } ] }, { "matcher": { "id": "byName", "options": "Write" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "green" } } ] }, { "matcher": { "id": "byName", "options": "Grep" }, "properties": [ { "id": "color", "value": { "mode": "fixed", "fixedColor": "orange" } } ] } ] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 23 }, "id": 14, "options": { "legend": { "calcs": ["lastNotNull", "max"], "displayMode": "table", "placement": "bottom", "showLegend": true }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "sum by (tool_name) (count_over_time({service_name=\"claude-code\"} |= \"claude_code.tool_result\" [$__range]))", "interval": "", "legendFormat": "{{tool_name}}", "refId": "A" } ], "title": "Cumulative Tool Usage", "type": "timeseries" }, { "datasource": { "type": "loki", "uid": "loki" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Success Rate %", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "auto", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "line" } }, "mappings": [], "max": 100, "min": 0, "thresholds": { "mode": "absolute", "steps": [ { "color": "red", "value": 0 }, { "color": "yellow", "value": 80 }, { "color": "green", "value": 95 } ] }, "unit": "percent" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 23 }, "id": 8, "options": { "legend": { "calcs": ["last"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "100 * (sum by (tool_name) (count_over_time({service_name=\"claude-code\"} |= \"claude_code.tool_result\" | json | success=\"true\" [15m]))) / (sum by (tool_name) (count_over_time({service_name=\"claude-code\"} |= \"claude_code.tool_result\" [15m])))", "interval": "", "legendFormat": "{{tool_name}}", "refId": "A" } ], "title": "Tool Success Rate", "type": "timeseries" }, { "title": "⚡ Performance & Errors", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 39}, "collapsed": false }, { "datasource": { "type": "loki", "uid": "loki" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Milliseconds", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "none" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "ms" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 40 }, "id": 9, "options": { "legend": { "calcs": ["mean", "max"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "avg by (model) (avg_over_time({service_name=\"claude-code\"} |= \"claude_code.api_request\" | unwrap duration_ms [$__interval]))", "interval": "", "legendFormat": "{{model}}", "refId": "A" } ], "title": "API Request Duration by Model", "type": "timeseries" }, { "datasource": { "type": "loki", "uid": "loki" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "Errors/min", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "bars", "fillOpacity": 80, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "normal" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 40 }, "id": 10, "options": { "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "sum by (status_code) (rate({service_name=\"claude-code\"} |= \"claude_code.api_error\" | json | __error__ = \"\" [$__interval]))", "interval": "", "legendFormat": "HTTP {{status_code}}", "refId": "A" } ], "title": "API Error Rate", "type": "timeseries" }, { "title": "📝 User Activity & Productivity", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 48}, "collapsed": false }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "line", "fillOpacity": 10, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 2, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "normal" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 49 }, "id": 11, "options": { "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum by (type) (rate(claude_code_lines_of_code_count_total{job=\"otel-collector\"}[5m]) * 60)", "interval": "", "legendFormat": "{{type}} lines/min", "refId": "A" } ], "title": "Code Changes Rate", "type": "timeseries" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "fieldConfig": { "defaults": { "color": { "mode": "palette-classic" }, "custom": { "axisLabel": "", "axisPlacement": "auto", "barAlignment": 0, "drawStyle": "bars", "fillOpacity": 80, "gradientMode": "none", "hideFrom": { "legend": false, "tooltip": false, "vis": false }, "lineInterpolation": "linear", "lineWidth": 1, "pointSize": 5, "scaleDistribution": { "type": "linear" }, "showPoints": "never", "spanNulls": false, "stacking": { "group": "A", "mode": "normal" }, "thresholdsStyle": { "mode": "off" } }, "mappings": [], "thresholds": { "mode": "absolute", "steps": [ { "color": "green", "value": null }, { "color": "red", "value": 80 } ] }, "unit": "short" }, "overrides": [] }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 49 }, "id": 12, "options": { "legend": { "calcs": ["sum"], "displayMode": "table", "placement": "bottom" }, "tooltip": { "mode": "multi", "sort": "desc" } }, "targets": [ { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_commit_count_total{job=\"otel-collector\"}[1h])) or vector(0)", "interval": "", "legendFormat": "Commits", "refId": "A" }, { "datasource": { "type": "prometheus", "uid": "prometheus" }, "expr": "sum(increase(claude_code_pull_request_count_total{job=\"otel-collector\"}[1h])) or vector(0)", "interval": "", "legendFormat": "Pull Requests", "refId": "B" } ], "title": "Development Activity (Hourly)", "type": "timeseries" }, { "title": "🔍 Event Logs", "type": "row", "gridPos": {"h": 1, "w": 24, "x": 0, "y": 57}, "collapsed": false }, { "datasource": { "type": "loki", "uid": "loki" }, "gridPos": { "h": 8, "w": 12, "x": 0, "y": 58 }, "id": 13, "options": { "dedupStrategy": "none", "enableLogDetails": true, "prettifyLogMessage": false, "showCommonLabels": false, "showLabels": false, "showTime": true, "sortOrder": "Descending", "wrapLogMessage": false }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "{service_name=\"claude-code\"} |= \"claude_code.tool_result\" | line_format \"{{.event_timestamp}} [{{.tool_name}}] {{if eq .success \\\"true\\\"}}✅{{else}}❌{{end}} {{.duration_ms}}ms {{if .error}}ERROR: {{.error}}{{end}}\"", "refId": "A" } ], "title": "Tool Execution Events", "type": "logs" }, { "datasource": { "type": "loki", "uid": "loki" }, "gridPos": { "h": 8, "w": 12, "x": 12, "y": 58 }, "id": 17, "options": { "dedupStrategy": "none", "enableLogDetails": true, "prettifyLogMessage": false, "showCommonLabels": false, "showLabels": false, "showTime": true, "sortOrder": "Descending", "wrapLogMessage": false }, "targets": [ { "datasource": { "type": "loki", "uid": "loki" }, "expr": "{service_name=\"claude-code\"} |= \"claude_code.api_error\" | line_format \"{{.event_timestamp}} [{{.model}}] ❌ HTTP {{.status_code}} {{.duration_ms}}ms ERROR: {{.error}}\"", "refId": "A" } ], "title": "API Error Events", "type": "logs" } ], "refresh": "30s", "schemaVersion": 27, "style": "dark", "tags": ["claude-code", "observability"], "templating": { "list": [] }, "time": { "from": "now-1h", "to": "now" }, "timepicker": {}, "timezone": "", "title": "Claude Code Observability", "uid": "claude-code-obs", "version": 1 } ================================================ FILE: collector-config.yaml ================================================ receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: # Add resource attributes for better analysis resource: attributes: - key: environment value: "production" action: upsert exporters: prometheus: endpoint: "0.0.0.0:8889" debug: verbosity: normal otlphttp: endpoint: http://loki:3100/otlp service: pipelines: metrics: receivers: [otlp] processors: [resource] exporters: [prometheus, debug] logs: receivers: [otlp] processors: [resource] exporters: [debug, otlphttp] ================================================ FILE: docker-compose-lgtm.yml ================================================ version: "3.9" services: lgtm: image: grafana/otel-lgtm:1.4.0 container_name: lgtm ports: - "3000:3000" # Grafana - "4317:4317" # OTLP gRPC - "4318:4318" # OTLP HTTP restart: unless-stopped ================================================ FILE: docker-compose.yml ================================================ networks: otel-network: driver: bridge services: otel-collector: image: otel/opentelemetry-collector-contrib:latest # ≈ 50 MB container_name: otel-collector command: ["--config=/etc/otel/collector-config.yaml"] volumes: - ./collector-config.yaml:/etc/otel/collector-config.yaml:ro ports: - "4317:4317" # OTLP gRPC in - "4318:4318" # OTLP HTTP in - "8889:8889" # Prometheus scrape out restart: unless-stopped networks: - otel-network prometheus: image: prom/prometheus:latest container_name: prometheus volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro ports: - "9090:9090" restart: unless-stopped depends_on: [otel-collector] networks: - otel-network loki: image: grafana/loki:latest container_name: loki ports: - "3100:3100" command: -config.file=/etc/loki/local-config.yaml restart: unless-stopped networks: - otel-network grafana: image: grafana/grafana-oss:latest container_name: grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false - GF_FEATURE_TOGGLES_ENABLE=logsSampleInExplore volumes: - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml:ro - ./grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml:ro - ./claude-code-dashboard.json:/var/lib/grafana/dashboards/claude-code-dashboard.json:ro ports: - "3000:3000" restart: unless-stopped depends_on: [prometheus, loki] networks: - otel-network ================================================ FILE: grafana-dashboards.yml ================================================ apiVersion: 1 providers: - name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards ================================================ FILE: grafana-datasources.yml ================================================ apiVersion: 1 datasources: - name: prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true - name: loki type: loki access: proxy url: http://loki:3100 - name: alertmanager type: alertmanager access: proxy url: http://alertmanager:9093 ================================================ FILE: prometheus.yml ================================================ global: scrape_interval: 15s scrape_configs: - job_name: 'otel-collector' static_configs: - targets: ['otel-collector:8889']