This guide covers monitoring, observability, and alerting for vCon Server.
curl http://localhost:8000/api/healthResponse:
{"status": "healthy"}curl http://localhost:8000/api/versionResponse:
{
"version": "2024.01.15",
"git_commit": "a1b2c3d",
"build_time": "2024-01-15T10:30:00Z"
}| Metric | Command | Description |
|---|---|---|
| Ingress Depth | LLEN {ingress_list} |
Items waiting to process |
| Egress Depth | LLEN {egress_list} |
Processed items awaiting pickup |
| DLQ Depth | LLEN DLQ:{ingress_list} |
Failed items |
# Check queue depths
docker compose exec redis redis-cli LLEN default
docker compose exec redis redis-cli LLEN DLQ:default
# Watch queue depth over time
watch -n 5 'docker compose exec -T redis redis-cli LLEN default'| Metric | Description | Target |
|---|---|---|
| Processing Rate | vCons/minute | Depends on volume |
| Processing Latency | Time per vCon | < 30s typical |
| Error Rate | Failed/Total | < 5% |
| DLQ Growth | New DLQ items/hour | 0 |
| Metric | Description | Alert Threshold |
|---|---|---|
| CPU Usage | Container CPU % | > 80% |
| Memory Usage | Container memory | > 80% |
| Disk Usage | Storage volume | > 80% |
| Network I/O | Bytes in/out | Baseline + 50% |
vCon Server includes OpenTelemetry instrumentation.
# Environment variables
OTEL_SERVICE_NAME=vcon-server
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_EXPORTER=otlp
OTEL_METRICS_EXPORTER=otlpservices:
otel-collector:
image: otel/opentelemetry-collector:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
networks:
- conserver
conserver:
environment:
- OTEL_SERVICE_NAME=vcon-server
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
exporters:
logging:
loglevel: debug
# Jaeger for traces
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
# Prometheus for metrics
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger, logging]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]Langfuse accepts traces via the standard OTLP HTTP endpoint — no separate collector needed.
In Langfuse, go to Settings → API Keys and copy your Public Key and Secret Key. Then run:
echo -n "pk-lf-YOUR_PUBLIC_KEY:sk-lf-YOUR_SECRET_KEY" | base64Copy the output for the next step.
Add the following to your .env file:
OTEL_EXPORTER_OTLP_ENDPOINT=http://<your-langfuse-host>:3000/api/public/otel
OTEL_EXPORTER_OTLP_PROTOCOL=http/protobuf
OTEL_EXPORTER_OTLP_HEADERS=Authorization=Basic <output_from_step_1>For Langfuse Cloud use:
OTEL_EXPORTER_OTLP_ENDPOINT=https://cloud.langfuse.com/api/public/otelTraces will appear in Langfuse under the service names conserver and api.
vCon Server can only point to one OTLP endpoint via environment variables. To fan out to multiple backends simultaneously, use the OTel Collector as a proxy. The example below uses SigNoz and Langfuse, but the same pattern works with any two OTLP-compatible backends.
vcon-server ──OTLP──▶ OTel Collector ──▶ Backend A (gRPC :4317)
└──▶ Backend B (HTTP)
In your .env:
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpcNo auth header needed here — the collector handles authentication to each backend separately.
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
networks:
- conserverUse
otel/opentelemetry-collector-contrib(not the slimotel/opentelemetry-collector) — it includes theotlphttpexporter needed for Langfuse.
Generate your Langfuse Basic auth token first:
echo -n "pk-lf-YOUR_PUBLIC_KEY:sk-lf-YOUR_SECRET_KEY" | base64Then create otel-collector-config.yaml:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
exporters:
# SigNoz — receives OTLP gRPC
otlp/signoz:
endpoint: <your-signoz-host>:4317
tls:
insecure: true
# Langfuse — receives OTLP HTTP
otlphttp/langfuse:
endpoint: http://<your-langfuse-host>:3000/api/public/otel
headers:
Authorization: "Basic <base64_from_above>"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/signoz, otlphttp/langfuse]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlp/signoz]Langfuse only ingests traces (LLM spans). Metrics go to SigNoz only.
Traces will appear in both backends. In Langfuse, look under service names conserver and api.
If using OpenTelemetry with Prometheus exporter:
curl http://localhost:8889/metrics# prometheus.yml
scrape_configs:
- job_name: 'vcon-server'
static_configs:
- targets: ['otel-collector:8889']
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']# Processing rate (vCons per minute)
rate(vcon_processed_total[5m]) * 60
# Error rate
rate(vcon_processing_errors_total[5m]) / rate(vcon_processed_total[5m])
# Average processing duration
histogram_quantile(0.95, rate(vcon_processing_duration_seconds_bucket[5m]))
# Queue depth
redis_list_length{list="default"}
# docker-compose.yml
services:
datadog-agent:
image: datadog/agent:latest
environment:
- DD_API_KEY=${DD_API_KEY}
- DD_SITE=datadoghq.com
- DD_APM_ENABLED=true
- DD_LOGS_ENABLED=true
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- /proc/:/host/proc/:ro
- /sys/fs/cgroup/:/host/sys/fs/cgroup:ro
networks:
- conserver# Enable Datadog APM
DD_AGENT_HOST=datadog-agent
DD_TRACE_ENABLED=true
DD_PROFILING_ENABLED=trueCreate dashboards for:
- Overview: Health, version, uptime
- Processing: Queue depths, throughput, latency
- Errors: DLQ depth, error rates, error types
- Resources: CPU, memory, network, disk
{
"dashboard": {
"title": "vCon Server",
"panels": [
{
"title": "Queue Depth",
"type": "graph",
"targets": [
{
"expr": "redis_list_length{list=~\"default|production\"}",
"legendFormat": "{{list}}"
}
]
},
{
"title": "Processing Rate",
"type": "stat",
"targets": [
{
"expr": "rate(vcon_processed_total[5m]) * 60"
}
]
},
{
"title": "DLQ Items",
"type": "stat",
"targets": [
{
"expr": "redis_list_length{list=~\"DLQ:.*\"}"
}
]
}
]
}
}# prometheus-alerts.yml
groups:
- name: vcon-server
rules:
- alert: HighQueueDepth
expr: redis_list_length{list="default"} > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Queue depth is high"
description: "Queue {{ $labels.list }} has {{ $value }} items"
- alert: DLQNotEmpty
expr: redis_list_length{list=~"DLQ:.*"} > 0
for: 5m
labels:
severity: warning
annotations:
summary: "DLQ has items"
description: "DLQ {{ $labels.list }} has {{ $value }} items"
- alert: HighErrorRate
expr: rate(vcon_processing_errors_total[5m]) / rate(vcon_processed_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: ServiceDown
expr: up{job="vcon-server"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "vCon Server is down"# alertmanager.yml
receivers:
- name: 'pagerduty'
pagerduty_configs:
- service_key: ${PAGERDUTY_SERVICE_KEY}
severity: '{{ .Labels.severity }}'
route:
receiver: 'pagerduty'
routes:
- match:
severity: critical
receiver: 'pagerduty'#!/bin/bash
# monitor.sh - Custom monitoring script
API_URL="${API_URL:-http://localhost:8000/api}"
TOKEN="${CONSERVER_API_TOKEN}"
THRESHOLD_QUEUE=100
THRESHOLD_DLQ=0
# Check API health
health=$(curl -s "$API_URL/health" | jq -r .status)
if [ "$health" != "healthy" ]; then
echo "CRITICAL: API health check failed"
exit 2
fi
# Check queue depth
queue_depth=$(docker compose exec -T redis redis-cli LLEN default)
if [ "$queue_depth" -gt "$THRESHOLD_QUEUE" ]; then
echo "WARNING: Queue depth is $queue_depth (threshold: $THRESHOLD_QUEUE)"
exit 1
fi
# Check DLQ
dlq_depth=$(docker compose exec -T redis redis-cli LLEN DLQ:default)
if [ "$dlq_depth" -gt "$THRESHOLD_DLQ" ]; then
echo "WARNING: DLQ has $dlq_depth items"
exit 1
fi
echo "OK: All checks passed"
exit 0# docker-compose.yml
services:
conserver:
logging:
driver: "fluentd"
options:
fluentd-address: localhost:24224
tag: vcon.serverservices:
conserver:
logging:
driver: "awslogs"
options:
awslogs-region: us-east-1
awslogs-group: vcon-server
awslogs-stream: "{{.Name}}"- Application (processing rate, errors)
- Infrastructure (CPU, memory, disk)
- Dependencies (Redis, databases)
- External services (Deepgram, OpenAI)
- Alert on symptoms, not causes
- Use appropriate thresholds
- Include runbook links in alerts
- Avoid alert fatigue
- Historical processing rates
- Queue depth over time
- Error patterns
- Resource usage trends
- Weekly review of dashboards
- Monthly alert threshold review
- Quarterly capacity planning