DevOps Agent Telemetry Configuration
This document explains how to configure and manage the DevOps Agent’s telemetry system, including solutions for Grafana Cloud rate limiting.
Overview
The DevOps Agent includes comprehensive telemetry capabilities:
- OpenLIT Integration: Automatic LLM observability
- Custom OpenTelemetry Metrics: Agent-specific performance tracking
- Grafana Cloud Export: Production-ready metrics export
- Local Development Tools: Rich dashboard for development
Environment Variables
Core Configuration
GRAFANA_OTLP_ENDPOINT
- Purpose: Grafana Cloud OTLP endpoint URL
- Example:
https://otlp-gateway-prod-us-central-0.grafana.net/otlp
- Required: For Grafana Cloud export
GRAFANA_OTLP_TOKEN
- Purpose: Grafana Cloud authentication token (base64 encoded)
- Format: Base64 encoded
instanceID:token
- Required: For Grafana Cloud export
OpenLIT Configuration
OPENLIT_ENVIRONMENT
- Default:
Production
- Purpose: Environment name for OpenLIT metrics
- Values:
Production
,Development
,Staging
, etc.
OPENLIT_COLLECT_GPU_STATS
- Default:
false
- Purpose: Enable GPU monitoring if GPU is available
- Values:
true
,false
,1
,0
,yes
,no
- Note: Requires GPU and nvidia-ml-py package. Disabled by default to avoid warnings on non-GPU systems.
OPENLIT_DISABLE_METRICS
- Default:
false
- Purpose: Completely disable OpenLIT metrics collection
- Values:
true
,false
,1
,0
,yes
,no
- Use Case: When you only want custom agent metrics
Rate Limiting Controls
GRAFANA_EXPORT_INTERVAL_SECONDS
- Default:
120
(2 minutes) - Purpose: How often to export metrics to Grafana Cloud
- Recommendation: Increase if hitting rate limits
- Example:
export GRAFANA_EXPORT_INTERVAL_SECONDS=300
(5 minutes)
GRAFANA_EXPORT_TIMEOUT_SECONDS
- Default:
30
- Purpose: Timeout for export requests
- Range: 10-60 seconds
DEVOPS_AGENT_DISABLE_TELEMETRY_EXPORT
- Default:
false
- Purpose: Completely disable telemetry export (local metrics only)
- Values:
true
,false
,1
,0
,yes
,no
- Use Case: Development, testing, or when hitting rate limits
Tracing Configuration
OPENLIT_CAPTURE_CONTENT
- Default:
true
- Purpose: Capture LLM prompts and completions in traces
- Values:
true
,false
,1
,0
,yes
,no
- Privacy: Set to
false
for sensitive data environments
OPENLIT_DISABLE_BATCH
- Default:
false
- Purpose: Disable batch processing of traces (useful for local development)
- Values:
true
,false
,1
,0
,yes
,no
- Use Case: Local debugging when you want immediate trace export
OPENLIT_DISABLED_INSTRUMENTORS
- Default: `` (empty)
- Purpose: Disable specific auto-instrumentation
- Format: Comma-separated list
- Example:
anthropic,langchain
to disable those instrumentors
TRACE_SAMPLING_RATE
- Default:
1.0
- Purpose: Control what percentage of operations to trace
- Range:
0.0
to1.0
- Example:
0.1
for 10% sampling in high-traffic environments
SERVICE_INSTANCE_ID
- Default:
devops-agent-{pid}
- Purpose: Unique identifier for this agent instance
- Use Case: Distinguish between multiple agent instances
SERVICE_VERSION
- Default:
1.0.0
- Purpose: Version identifier for traces
- Use Case: Track performance across different agent versions
Rate Limiting Solutions
Problem: Grafana Cloud 429 Errors
If you see errors like:
Failed to export batch code: 429, reason: the request has been rejected because the tenant exceeded the request rate limit
Solution 1: Increase Export Interval
# Export every 5 minutes instead of 2 minutes
export GRAFANA_EXPORT_INTERVAL_SECONDS=300
./run.sh
Solution 2: Disable Export for Development
# Disable Grafana Cloud export entirely
export DEVOPS_AGENT_DISABLE_TELEMETRY_EXPORT=true
./run.sh
Solution 3: Remove Credentials Temporarily
# Unset Grafana Cloud credentials
unset GRAFANA_OTLP_ENDPOINT
unset GRAFANA_OTLP_TOKEN
./run.sh
Metric Types
The agent exports these metric types to Grafana Cloud:
OpenLIT Auto-Instrumentation Metrics
LLM/GenAI Metrics:
gen_ai.total.requests
: Number of LLM requestsgen_ai.usage.input_tokens
: Input tokens processedgen_ai.usage.output_tokens
: Output tokens processedgen_ai.usage.total_tokens
: Total tokens processedgen_ai.usage.cost
: Cost distribution of LLM requests
VectorDB Metrics:
db.total.requests
: Number of VectorDB requests (ChromaDB)
GPU Metrics (if enabled):
gpu.utilization
: GPU utilization percentagegpu.memory.used/available/total/free
: GPU memory metricsgpu.temperature
: GPU temperature in Celsiusgpu.power.draw/limit
: GPU power metricsgpu.fan_speed
: GPU fan speed
Custom Agent Metrics
Counters:
devops_agent_operations_total
: Total operations by type and statusdevops_agent_errors_total
: Total errors by operation and error typedevops_agent_tokens_total
: Total tokens consumed by model and typedevops_agent_tool_usage_total
: Total tool executions by tool typedevops_agent_context_operations_total
: Total context management operations
Histograms:
devops_agent_operation_duration_seconds
: Operation execution timesdevops_agent_llm_response_time_seconds
: LLM response times by modeldevops_agent_context_size_tokens
: Context sizes in tokensdevops_agent_tool_execution_seconds
: Tool execution timesdevops_agent_file_operation_bytes
: File operation sizes
Gauges:
devops_agent_active_tools
: Currently active tool executionsdevops_agent_context_cache_items
: Number of items in context cachedevops_agent_memory_usage_mb
: Current memory usagedevops_agent_cpu_usage_percent
: Current CPU usagedevops_agent_disk_usage_mb
: Current disk usagedevops_agent_avg_response_time
: Rolling average response time
Tracing Capabilities
The agent provides comprehensive distributed tracing through OpenLIT and custom instrumentation.
OpenLIT Auto-Instrumentation Traces
LLM Request Traces:
- Complete request/response lifecycle
- Automatic span creation for each LLM call
- Token usage and cost tracking per request
- Model performance metrics
- Error context and exception details
Trace Attributes (Semantic Conventions):
gen_ai.system
: LLM provider (google, openai, anthropic)gen_ai.request.model
: Model name (gemini-1.5-flash)gen_ai.operation.name
: Operation type (chat, embedding)gen_ai.request.temperature
: Model temperaturegen_ai.usage.input_tokens
: Prompt tokensgen_ai.usage.output_tokens
: Completion tokensgen_ai.usage.cost
: Request cost in USD
VectorDB Traces:
- ChromaDB operations (query, insert, update)
- Collection and index operations
- Query performance and result counts
Custom Agent Traces
Agent Lifecycle Traces:
- User request processing
- Planning and execution phases
- Context management operations
- Tool orchestration
Tool Execution Traces:
- Individual tool performance
- Input/output size tracking
- Success/failure rates
- Error context and recovery
Manual Tracing Examples:
# OpenLIT decorator tracing
@openlit.trace
def complex_operation():
return process_data()
# OpenLIT context manager tracing
with openlit.start_trace("multi_step_process") as trace:
result = step1()
trace.set_metadata({"step1_result": len(result)})
final = step2(result)
trace.set_result(f"Processed {len(final)} items")
# Custom agent tracing
with trace_tool_execution("shell_command", command=cmd) as trace:
result = execute_command(cmd)
trace.set_metadata({
"exit_code": result.exit_code,
"output_size": len(result.stdout)
})
Trace Export and Analysis
Export Destinations:
- Grafana Cloud (production monitoring)
- Jaeger (distributed trace visualization)
- Zipkin (trace analysis)
- Local development (debugging)
Analysis Capabilities:
- End-to-end request flow visualization
- Performance bottleneck identification
- Error root cause analysis
- Cost optimization insights
- Capacity planning data
Local Development
For local development without Grafana Cloud:
# Run telemetry dashboard
uvx --with "rich>=13.0.0" --with "psutil>=5.9.0" python scripts/telemetry_dashboard.py
# Check telemetry configuration
uv run python scripts/telemetry_check.py
Production Deployment
Recommended Settings
# Production environment variables
export GRAFANA_OTLP_ENDPOINT="your-grafana-endpoint"
export GRAFANA_OTLP_TOKEN="your-base64-token"
export GRAFANA_EXPORT_INTERVAL_SECONDS=300 # 5 minutes
export GRAFANA_EXPORT_TIMEOUT_SECONDS=30
export DEVOPS_AGENT_INTERACTIVE=false # Full logging
Rate Limit Monitoring
Monitor your Grafana Cloud usage:
- Check your Grafana Cloud metrics usage dashboard
- Monitor for 429 errors in agent logs
- Adjust export intervals based on usage patterns
Troubleshooting
High Rate Limit Usage
Symptoms: 429 errors, export failures Solutions:
- Increase
GRAFANA_EXPORT_INTERVAL_SECONDS
to 300-600 seconds - Temporarily disable export with
DEVOPS_AGENT_DISABLE_TELEMETRY_EXPORT=true
- Contact Grafana support to increase rate limits
Missing Metrics
Symptoms: No data in Grafana Cloud Check:
- Verify
GRAFANA_OTLP_ENDPOINT
andGRAFANA_OTLP_TOKEN
are set - Check agent logs for export errors
- Verify network connectivity to Grafana Cloud
Local Development Issues
Symptoms: Dashboard not working Solutions:
- Install dependencies:
pip install rich psutil
- Run from project root directory
- Check that telemetry module is importable
Best Practices
- Development: Use
DEVOPS_AGENT_DISABLE_TELEMETRY_EXPORT=true
- Testing: Set longer export intervals (300+ seconds)
- Production: Monitor rate limit usage and adjust intervals
- CI/CD: Disable telemetry export in automated pipelines
- Debugging: Use local telemetry dashboard for immediate feedback
Integration Examples
Docker Deployment
ENV GRAFANA_OTLP_ENDPOINT=https://otlp-gateway-prod-us-central-0.grafana.net/otlp
ENV GRAFANA_OTLP_TOKEN=your-token-here
ENV GRAFANA_EXPORT_INTERVAL_SECONDS=300
ENV DEVOPS_AGENT_INTERACTIVE=false
Kubernetes Deployment
env:
- name: GRAFANA_OTLP_ENDPOINT
valueFrom:
secretKeyRef:
name: grafana-credentials
key: endpoint
- name: GRAFANA_OTLP_TOKEN
valueFrom:
secretKeyRef:
name: grafana-credentials
key: token
- name: GRAFANA_EXPORT_INTERVAL_SECONDS
value: "300"