Skip to content

Latest commit

 

History

History
544 lines (424 loc) · 11.5 KB

File metadata and controls

544 lines (424 loc) · 11.5 KB

Monitoring

This guide covers monitoring, observability, and alerting for vCon Server.

Health Endpoints

API Health

curl http://localhost:8000/api/health

Response:

{"status": "healthy"}

Version Information

curl http://localhost:8000/api/version

Response:

{
  "version": "2024.01.15",
  "git_commit": "a1b2c3d",
  "build_time": "2024-01-15T10:30:00Z"
}

Key Metrics

Queue Metrics

Metric Command Description
Ingress Depth LLEN {ingress_list} Items waiting to process
Egress Depth LLEN {egress_list} Processed items awaiting pickup
DLQ Depth LLEN DLQ:{ingress_list} Failed items
# Check queue depths
docker compose exec redis redis-cli LLEN default
docker compose exec redis redis-cli LLEN DLQ:default

# Watch queue depth over time
watch -n 5 'docker compose exec -T redis redis-cli LLEN default'

Processing Metrics

Metric Description Target
Processing Rate vCons/minute Depends on volume
Processing Latency Time per vCon < 30s typical
Error Rate Failed/Total < 5%
DLQ Growth New DLQ items/hour 0

Resource Metrics

Metric Description Alert Threshold
CPU Usage Container CPU % > 80%
Memory Usage Container memory > 80%
Disk Usage Storage volume > 80%
Network I/O Bytes in/out Baseline + 50%

OpenTelemetry Integration

vCon Server includes OpenTelemetry instrumentation.

Enable Tracing

# Environment variables
OTEL_SERVICE_NAME=vcon-server
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlp

Docker Compose with Collector

services:
  otel-collector:
    image: otel/opentelemetry-collector:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
    networks:
      - conserver

  conserver:
    environment:
      - OTEL_SERVICE_NAME=vcon-server
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317

Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:

exporters:
  logging:
    loglevel: debug
  
  # Jaeger for traces
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  # Prometheus for metrics
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Langfuse Integration

Langfuse accepts traces via the standard OTLP HTTP endpoint — no separate collector needed.

Step 1 — Generate your auth header

In Langfuse, go to Settings → API Keys and copy your Public Key and Secret Key. Then run:

echo -n "pk-lf-YOUR_PUBLIC_KEY:sk-lf-YOUR_SECRET_KEY" | base64

Copy the output for the next step.

Step 2 — Set environment variables

Add the following to your .env file:

OTEL_EXPORTER_OTLP_ENDPOINT=http://<your-langfuse-host>:3000/api/public/otel
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <output_from_step_1>

For Langfuse Cloud use:

OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otel

Traces will appear in Langfuse under the service names conserver and api.

Sending to Multiple Backends (Fan-out via OTel Collector)

vCon Server can only point to one OTLP endpoint via environment variables. To fan out to multiple backends simultaneously, use the OTel Collector as a proxy. The example below uses SigNoz and Langfuse, but the same pattern works with any two OTLP-compatible backends.

vcon-server ──OTLP──▶ OTel Collector ──▶ Backend A (gRPC :4317)
                                      └──▶ Backend B (HTTP)

Step 1 — Point vCon Server at the collector

In your .env:

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

No auth header needed here — the collector handles authentication to each backend separately.

Step 2 — Add the collector to docker-compose.yml

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
    networks:
      - conserver

Use otel/opentelemetry-collector-contrib (not the slim otel/opentelemetry-collector) — it includes the otlphttp exporter needed for Langfuse.

Step 3 — Configure the collector (otel-collector-config.yaml)

Generate your Langfuse Basic auth token first:

echo -n "pk-lf-YOUR_PUBLIC_KEY:sk-lf-YOUR_SECRET_KEY" | base64

Then create otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:

exporters:
  # SigNoz — receives OTLP gRPC
  otlp/signoz:
    endpoint: <your-signoz-host>:4317
    tls:
      insecure: true

  # Langfuse — receives OTLP HTTP
  otlphttp/langfuse:
    endpoint: http://<your-langfuse-host>:3000/api/public/otel
    headers:
      Authorization: "Basic <base64_from_above>"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/signoz, otlphttp/langfuse]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/signoz]

Langfuse only ingests traces (LLM spans). Metrics go to SigNoz only.

Traces will appear in both backends. In Langfuse, look under service names conserver and api.


Prometheus Integration

Metrics Endpoint

If using OpenTelemetry with Prometheus exporter:

curl http://localhost:8889/metrics

Prometheus Configuration

# prometheus.yml
scrape_configs:
  - job_name: 'vcon-server'
    static_configs:
      - targets: ['otel-collector:8889']
    
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

Key Prometheus Queries

# Processing rate (vCons per minute)
rate(vcon_processed_total[5m]) * 60

# Error rate
rate(vcon_processing_errors_total[5m]) / rate(vcon_processed_total[5m])

# Average processing duration
histogram_quantile(0.95, rate(vcon_processing_duration_seconds_bucket[5m]))

# Queue depth
redis_list_length{list="default"}

Datadog Integration

Agent Configuration

# docker-compose.yml
services:
  datadog-agent:
    image: datadog/agent:latest
    environment:
      - DD_API_KEY=${DD_API_KEY}
      - DD_SITE=datadoghq.com
      - DD_APM_ENABLED=true
      - DD_LOGS_ENABLED=true
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /proc/:/host/proc/:ro
      - /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
    networks:
      - conserver

Application Configuration

# Enable Datadog APM
DD_AGENT_HOST=datadog-agent
DD_TRACE_ENABLED=true
DD_PROFILING_ENABLED=true

Datadog Dashboards

Create dashboards for:

  1. Overview: Health, version, uptime
  2. Processing: Queue depths, throughput, latency
  3. Errors: DLQ depth, error rates, error types
  4. Resources: CPU, memory, network, disk

Grafana Dashboards

Sample Dashboard JSON

{
  "dashboard": {
    "title": "vCon Server",
    "panels": [
      {
        "title": "Queue Depth",
        "type": "graph",
        "targets": [
          {
            "expr": "redis_list_length{list=~\"default|production\"}",
            "legendFormat": "{{list}}"
          }
        ]
      },
      {
        "title": "Processing Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(vcon_processed_total[5m]) * 60"
          }
        ]
      },
      {
        "title": "DLQ Items",
        "type": "stat",
        "targets": [
          {
            "expr": "redis_list_length{list=~\"DLQ:.*\"}"
          }
        ]
      }
    ]
  }
}

Alerting

Alert Rules

# prometheus-alerts.yml
groups:
  - name: vcon-server
    rules:
      - alert: HighQueueDepth
        expr: redis_list_length{list="default"} > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Queue depth is high"
          description: "Queue {{ $labels.list }} has {{ $value }} items"

      - alert: DLQNotEmpty
        expr: redis_list_length{list=~"DLQ:.*"} > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "DLQ has items"
          description: "DLQ {{ $labels.list }} has {{ $value }} items"

      - alert: HighErrorRate
        expr: rate(vcon_processing_errors_total[5m]) / rate(vcon_processed_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: ServiceDown
        expr: up{job="vcon-server"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "vCon Server is down"

PagerDuty Integration

# alertmanager.yml
receivers:
  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: ${PAGERDUTY_SERVICE_KEY}
        severity: '{{ .Labels.severity }}'

route:
  receiver: 'pagerduty'
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'

Custom Monitoring Script

#!/bin/bash
# monitor.sh - Custom monitoring script

API_URL="${API_URL:-http://localhost:8000/api}"
TOKEN="${CONSERVER_API_TOKEN}"
THRESHOLD_QUEUE=100
THRESHOLD_DLQ=0

# Check API health
health=$(curl -s "$API_URL/health" | jq -r .status)
if [ "$health" != "healthy" ]; then
    echo "CRITICAL: API health check failed"
    exit 2
fi

# Check queue depth
queue_depth=$(docker compose exec -T redis redis-cli LLEN default)
if [ "$queue_depth" -gt "$THRESHOLD_QUEUE" ]; then
    echo "WARNING: Queue depth is $queue_depth (threshold: $THRESHOLD_QUEUE)"
    exit 1
fi

# Check DLQ
dlq_depth=$(docker compose exec -T redis redis-cli LLEN DLQ:default)
if [ "$dlq_depth" -gt "$THRESHOLD_DLQ" ]; then
    echo "WARNING: DLQ has $dlq_depth items"
    exit 1
fi

echo "OK: All checks passed"
exit 0

Logging Integration

Ship Logs to ELK

# docker-compose.yml
services:
  conserver:
    logging:
      driver: "fluentd"
      options:
        fluentd-address: localhost:24224
        tag: vcon.server

Ship Logs to CloudWatch

services:
  conserver:
    logging:
      driver: "awslogs"
      options:
        awslogs-region: us-east-1
        awslogs-group: vcon-server
        awslogs-stream: "{{.Name}}"

Best Practices

1. Monitor All Layers

  • Application (processing rate, errors)
  • Infrastructure (CPU, memory, disk)
  • Dependencies (Redis, databases)
  • External services (Deepgram, OpenAI)

2. Set Meaningful Alerts

  • Alert on symptoms, not causes
  • Use appropriate thresholds
  • Include runbook links in alerts
  • Avoid alert fatigue

3. Visualize Trends

  • Historical processing rates
  • Queue depth over time
  • Error patterns
  • Resource usage trends

4. Regular Review

  • Weekly review of dashboards
  • Monthly alert threshold review
  • Quarterly capacity planning