Automatically detects errors from any Lambda in your AWS account, retrieves relevant source code from a vector knowledge base, and posts an AI-generated root cause analysis to Slack — with no per-project setup required.
The system has two independent pipelines that work together.
Pipeline 1 — Error detection & analysis
A Lambda throws an error → CloudWatch Logs captures it → a subscription filter forwards the log batch to a forwarder Lambda → the forwarder enqueues a synthetic alarm payload → the analyser picks it up, deduplicates it, and triggers a Step Functions workflow that fetches logs, retrieves relevant code from the knowledge base, reranks the results, invokes Claude to generate a root cause analysis, and posts the result to Slack.
Pipeline 2 — Code indexing
A developer pushes to GitHub → GitHub fires a webhook → the indexer Lambda verifies the payload, fetches changed files via the GitHub API, hashes them, chunks them, uploads them to S3, and triggers a Bedrock Knowledge Base ingestion job so the new code is immediately searchable.
Two background jobs run daily to keep everything wired up automatically: one registers webhooks on new GitHub repos, the other adds subscription filters to new Lambda log groups.
┌─────────────────────────────────────────────────────────────────────────────┐
│ ERROR DETECTION PIPELINE │
│ │
│ Any Lambda ──► CloudWatch Logs ──► Subscription Filter │
│ │ │
│ ▼ │
│ Log Forwarder Lambda │
│ │ │
│ ┌────────────────────┘ │
│ │ │
│ ▼ │
│ SQS Ingress Queue (DLQ after 3 retries) │
│ │ │
│ ▼ │
│ Analyser Lambda │
│ (dedup check ──► skip if seen in last 30 min) │
│ │ │
│ ▼ │
│ Step Functions Express Workflow │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ ▼ ▼ ▼ │
│ fetchLogs retrieveCode (parallel) │
│ (CW Insights) (Bedrock KB search) │
│ │ │ │
│ └──────────────────┘ │
│ │ │
│ ▼ │
│ rerank │
│ (Bedrock Rerank API) │
│ │ │
│ ▼ │
│ chooseModel (by alarm severity) │
│ CRITICAL/ERROR → Claude Opus │
│ WARNING/INFO → Claude Haiku │
│ │ │
│ ▼ │
│ analyse │
│ (Claude via Bedrock) │
│ │ │
│ ▼ │
│ notify │
│ (Slack Incoming Webhook) │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ CODE INDEXING PIPELINE │
│ │
│ GitHub push ──► API Gateway ──► Indexer Lambda │
│ │ │
│ verify HMAC-SHA256 │
│ │ │
│ fetch changed files │
│ (GitHub API + PAT) │
│ │ │
│ hash file contents │
│ (SHA-256 in memory) │
│ │ │
│ diff against DynamoDB │
│ (skip unchanged files) │
│ │ │
│ chunk changed files │
│ (function/class boundaries) │
│ │ │
│ upload chunks to S3 │
│ chunks/{owner/repo}/{file}.json │
│ │ │
│ delete S3 objects │
│ (for removed files) │
│ │ │
│ start Bedrock KB │
│ ingestion job │
│ │ │
│ persist hashes to DynamoDB │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ BACKGROUND JOBS (daily) │
│ │
│ EventBridge ──► Webhook Registrar Lambda │
│ lists all GitHub repos owned by the user │
│ adds indexer webhook to any repo missing it │
│ │
│ EventBridge ──► Subscription Registrar Lambda │
│ lists all /aws/lambda/* log groups │
│ adds error subscription filter to any missing it │
│ excludes cloud-error-* to prevent feedback loops │
└─────────────────────────────────────────────────────────────────────────────┘
CloudWatch Logs subscription filters are attached to every /aws/lambda/* log group in the account. When a Lambda produces a log line matching the error pattern (ERROR, Exception, Traceback, CRITICAL, Fatal, panic), CloudWatch delivers a gzipped, base64-encoded batch to the Log Forwarder Lambda.
The forwarder:
- Decompresses and parses the log batch
- Scans each log event for error patterns
- Takes the first matching line as the error message (truncated to 500 chars)
- Builds a synthetic
AlarmPayloadshaped like a CloudWatch alarm, using the Lambda function name from the log group - Enqueues it to the SQS ingress queue
This means any Lambda in the same AWS account is automatically covered with zero per-project configuration.
The Analyser Lambda reads from SQS. Before starting the expensive Step Functions workflow, it checks the DynamoDB dedup table using a SHA-256 hash of the error message. If the same error has been seen within the last 30 minutes, a lightweight "duplicate suppressed" notice is posted to Slack instead, and the full pipeline is skipped.
A single Task Lambda handles all five steps of the Step Functions Express Workflow. Each invocation receives the current pipeline state and a step field telling it what to do.
fetchLogs
Runs a CloudWatch Logs Insights query on the Lambda's log group, fetching up to 50 log lines in a ±60 second window around the error timestamp. Polls until the query completes (up to 30 seconds). If the log group does not exist, returns an empty array rather than failing the pipeline.
retrieveCode
Queries the Bedrock Knowledge Base using the error message as the search query, enriched with any function names extracted from the stack trace. Returns up to 40 semantically relevant code chunks. Each chunk includes the full file path scoped to the repo (owner/repo/src/file.ts), start line, end line, and text.
rerank
Sends the retrieved chunks to the Bedrock Rerank API (cohere.rerank-v3-5:0 by default, configurable). The reranker scores each chunk for relevance to the error query and returns the top 8. This step filters out broadly-similar-but-not-actually-useful chunks before they reach Claude.
chooseModel
A Step Functions Choice state (no Lambda invocation). Routes high-severity errors (*CRITICAL*, *ERROR* in the alarm name or state reason) to Claude Opus, and lower-severity ones to Claude Haiku. Model IDs come from lib/config.ts.
analyse
Invokes Claude via the Bedrock InvokeModel API. Passes the alarm name, error reason, log lines, and ranked code chunks as structured XML context. Claude responds with a JSON object containing rootCause, trigger, suggestedFix, severity, and filePath. The prompt instructs Claude to return raw JSON only; any markdown fences in the response are stripped before parsing.
notify
Posts the analysis to Slack using an Incoming Webhook. The message uses Block Kit attachments with a colour-coded severity bar (red for CRITICAL, orange for ERROR, yellow for WARNING, green for INFO). Includes root cause, trigger, suggested fix, relevant file, and a deep link to the CloudWatch alarm.
This is the most complex part of the system. The goal is to make your source code searchable by meaning so that when an error occurs in a Lambda, the analyser can ask "which parts of my codebase are related to this error?" and get back relevant functions and classes — not just a text search match, but semantic proximity.
When Claude analyses an error, it needs context beyond just the log line. It needs to see the actual code that produced it. The challenge is that this code lives in GitHub, not in AWS. The indexer bridges this gap by continuously syncing your code into a Bedrock Knowledge Base that the analyser can query at runtime.
PUSH TIME QUERY TIME
(runs on every git push) (runs on every Lambda error)
GitHub repo Error message
│ │
│ webhook │ semantic search
▼ ▼
Indexer Lambda Bedrock KB query
│ │
├─ fetch changed files ├─ converts error to a vector
├─ hash → skip unchanged ├─ finds nearest code vectors
├─ chunk into functions └─ returns top 40 matching chunks
├─ upload to S3 │
└─ trigger KB ingestion ▼
│ Reranker → top 8 chunks
▼ │
Bedrock embeds ▼
each chunk as Claude analyses
a vector and stores error + logs + code
in S3 Vectors
GitHub sends a POST to the indexer endpoint with the push payload. Before anything else, the HMAC-SHA256 signature in the X-Hub-Signature-256 header is validated against the webhook secret using a timing-safe byte comparison. Requests with invalid signatures are rejected with 401 immediately.
Only push events are processed. ping events (sent when a webhook is first created) return pong. All other event types are acknowledged and ignored.
The push payload contains a list of commits, each with added, modified, and removed file paths. The indexer builds a combined list of files that need re-indexing (added + modified) and files that need removing.
Each file to re-index is fetched individually from the GitHub Contents API using the commit SHA as the ref — this ensures we get exactly the version of the file that was pushed, not the latest HEAD.
GET https://api.github.com/repos/{owner}/{repo}/contents/{path}?ref={sha}
Authorization: Bearer {github-pat}
The file content is returned base64-encoded and decoded in memory. No disk I/O.
Each fetched file's content is SHA-256 hashed. The hash is compared against the last known hash stored in DynamoDB under the key {owner}/{repo}/{filePath}.
DynamoDB key: "SamuelLawrence876/jse-bot/src/trades/handler.ts"
DynamoDB item: { fileName: "...", hash: "a3f9c2...", ttl: 1735000000 }
If the hash matches, the file is skipped. This means a push that only changes one file out of 50 only re-indexes that one file. Hashes expire after 90 days so stale entries are automatically cleaned up.
Storing whole files in the Knowledge Base would be inefficient — a 500-line file would be one giant blob, and the search would return the entire file for every query. Instead, each file is split into smaller chunks at meaningful code boundaries.
The chunker scans each line looking for function, class, and method declarations:
TypeScript/JS: export function ..., const x = () => ..., class Foo
Python: def my_function
Go: func myFunction
Java: public void method(
When it finds a boundary (and the current chunk is at least 5 lines), it closes the current chunk and starts a new one. Chunks are capped at 80 lines to prevent oversized blobs.
Example — this file:
import { DynamoDBClient } from '@aws-sdk/client-dynamodb'; // lines 1–3
const client = new DynamoDBClient({});
export const getOrder = async (id: string) => { // lines 4–12
const result = await client.send(...);
if (!result.Item) throw new Error(`Order ${id} not found`);
return result.Item;
// ... more code ...
};
export const createOrder = async (order: Order) => { // lines 13–25
await client.send(...);
return order;
// ... more code ...
};Would produce two chunks:
- Chunk 1: lines 1–12, text = the
getOrderfunction and its imports - Chunk 2: lines 13–25, text = the
createOrderfunction
Supported file types: .ts, .js, .tsx, .jsx, .py, .go, .java. Other file types (JSON, YAML, markdown, etc.) are skipped.
Each file's chunks are serialised as a JSON array and uploaded to S3 at a path scoped by repo and file:
s3://cloud-error-code-chunks/chunks/SamuelLawrence876/jse-bot/src/trades/handler.ts.json
The JSON contains the chunk text plus metadata that Bedrock will attach to each vector:
[
{
"id": "SamuelLawrence876/jse-bot/src/trades/handler.ts:4",
"text": "export const getOrder = async (id: string) => { ... }",
"metadata": {
"repo": "SamuelLawrence876/jse-bot",
"filePath": "SamuelLawrence876/jse-bot/src/trades/handler.ts",
"startLine": 4,
"endLine": 12
}
}
]For removed files, the corresponding S3 object is deleted so the Knowledge Base doesn't return stale chunks for code that no longer exists.
After S3 is updated, a Bedrock ingestion job is started. Bedrock scans the S3 bucket for new or changed objects, calls Amazon Titan Text Embeddings V2 on each chunk's text to produce a high-dimensional vector, and stores the vectors in the S3 Vectors backing store.
This is the "learning" step — Bedrock is converting human-readable code into a mathematical representation that can be searched by semantic similarity.
"export const getOrder = async (id: string) => { ... }"
│
│ Titan Text Embeddings V2
▼
[0.023, -0.187, 0.441, 0.009, ... ] (1536 dimensions)
│
▼
stored in S3 Vectors
The ingestion job runs asynchronously — the Lambda returns a 200 immediately after starting it and does not wait for it to finish (it typically takes 1–3 minutes for a small batch).
Once S3 and KB ingestion are kicked off, the new file hashes are written back to DynamoDB. This updates the baseline for the next push — if these exact files are pushed again with no changes, they'll be skipped.
All your repos share the same S3 bucket and the same Knowledge Base. Each chunk's filePath metadata is prefixed with the repo name (SamuelLawrence876/jse-bot/src/...). When the analyser queries the KB for a jse-bot Lambda error, it gets chunks back from the jse-bot repo specifically because those are the semantically closest vectors to the error message — not because there's any explicit filtering.
This is the key insight: you don't need to tell the system which repo an error came from. The vector search naturally returns the most relevant code across all indexed repos.
Runs daily on an EventBridge schedule. Lists all repositories owned by the configured GitHub user, checks each for the indexer webhook URL, and creates it if missing. This means new GitHub repositories are automatically wired up within 24 hours of being created, with no manual steps.
Runs daily on an EventBridge schedule. Lists all /aws/lambda/* log groups in the account and ensures each one has the cloud-error-error-filter subscription filter pointing at the Log Forwarder Lambda. Any cloud-error-* log groups are excluded to prevent the system from feeding its own errors back into itself.
This also runs on demand — invoke it manually after deploying a new Lambda to immediately opt it into error detection without waiting for the next daily run.
All infrastructure is defined with AWS CDK in lib/. There are no manual CloudFormation steps.
| Resource | Type | Purpose |
|---|---|---|
cloud-error-analyser-ingress |
SQS Queue | Buffers alarm payloads before analysis |
cloud-error-analyser-dlq |
SQS Queue | Dead letters after 3 failed analysis attempts |
cloud-error-analyser |
Lambda | Reads from SQS, deduplicates, starts Step Functions |
cloud-error-analysis-workflow |
Step Functions (Express) | Orchestrates the 5-step analysis pipeline |
cloud-error-analyser-task |
Lambda | Executes each step of the workflow |
cloud-error-dedup |
DynamoDB Table | Tracks seen errors for 30-minute dedup window |
cloud-error-indexer |
Lambda | Handles GitHub push webhooks |
cloud-error-indexer |
API Gateway HTTP API | Exposes the indexer as a webhook endpoint |
cloud-error-code-chunks |
S3 Bucket | Stores chunked source code as JSON |
cloud-error-file-hashes |
DynamoDB Table | Stores per-file SHA-256 hashes for change detection |
cloud-error-webhook-registrar |
Lambda | Daily job to register webhooks on new GitHub repos |
cloud-error-log-forwarder |
Lambda | Receives CW Logs batches, enqueues error payloads |
cloud-error-subscription-registrar |
Lambda | Daily job to attach subscription filters to log groups |
cloud-error-analyser |
CloudWatch Dashboard | Monitors queue depth, execution success/failure |
This is the only file that contains identity-specific values:
export const config = {
domain: {
hostedZone: 'your-domain.com', // your Route 53 hosted zone
subdomain: 'cloud-error.your-domain.com', // subdomain for the webhook endpoint
},
ssm: {
slackWebhookUrl: '/cloud-error/slack-webhook-url',
bedrockKbId: '/cloud-error/bedrock-kb-id',
bedrockKbDsId: '/cloud-error/bedrock-kb-ds-id',
githubPat: '/cloud-error/github-pat',
githubWebhookSecret: '/cloud-error/github-webhook-secret',
},
bedrock: {
highSeverityModel: 'us.anthropic.claude-opus-4-7-v1:0',
standardModel: 'us.anthropic.claude-haiku-4-5-20251001-v1:0',
rerankModelId: 'cohere.rerank-v3-5:0',
},
} as const;If you don't have a custom domain, remove IndexerDomain from lib/cloudErrorStack.ts and replace the webhookUrl prop with indexerApi.api.apiEndpoint + '/index'.
Region note for rerankModelId: cohere.rerank-v3-5:0 works in all supported regions including us-east-1. If you are deploying to us-west-2, ca-central-1, eu-central-1, or ap-northeast-1, you can use amazon.rerank-v1:0 for a fully AWS-native setup with no external model dependency.
# Slack — create an Incoming Webhook at api.slack.com/apps
aws ssm put-parameter --name /cloud-error/slack-webhook-url \
--value "https://hooks.slack.com/services/..." --type String
# GitHub — Personal Access Token with `repo` scope
aws ssm put-parameter --name /cloud-error/github-pat \
--value "ghp_..." --type String
# GitHub — any random string used to sign webhook payloads
aws ssm put-parameter --name /cloud-error/github-webhook-secret \
--value "$(openssl rand -hex 32)" --type String
# Bedrock — fill these after the first CDK deploy (see step 4)
aws ssm put-parameter --name /cloud-error/bedrock-kb-id --value "PLACEHOLDER" --type String
aws ssm put-parameter --name /cloud-error/bedrock-kb-ds-id --value "PLACEHOLDER" --type Stringcdk bootstrap aws://YOUR_ACCOUNT_ID/YOUR_REGION
npm run deployThe first deploy creates the cloud-error-code-chunks S3 bucket, which you need before creating the Knowledge Base.
AWS CDK cannot fully automate Knowledge Base creation. In the AWS Console:
- Go to Amazon Bedrock → Knowledge Bases → Create
- Set the S3 data source to the
cloud-error-code-chunksbucket - Choose Amazon Titan Text Embeddings V2 as the embedding model
- Choose S3 Vectors as the vector store — this is pay-per-use with no fixed minimum cost
- Note the Knowledge Base ID and Data Source ID
Update SSM with the real IDs:
aws ssm put-parameter --name /cloud-error/bedrock-kb-id --value "YOUR_KB_ID" --type String --overwrite
aws ssm put-parameter --name /cloud-error/bedrock-kb-ds-id --value "YOUR_DS_ID" --type String --overwritenpm run deploy
# Wire up all existing Lambda log groups immediately
aws lambda invoke --function-name cloud-error-subscription-registrar /dev/null
# Register webhooks on all existing GitHub repos immediately
aws lambda invoke --function-name cloud-error-webhook-registrar /dev/nullAfter this, the system is fully live.
In the AWS Console, go to Amazon Bedrock → Model Access and request access to:
- Anthropic Claude Opus 4.7
- Anthropic Claude Haiku 4.5
The model IDs in lib/config.ts use cross-region inference profiles (the us. prefix). These require the underlying models to be enabled in your account.
The GitHub Actions pipeline in .github/workflows/pipeline.yml runs on every push and pull request to master.
| Job | Trigger | Steps |
|---|---|---|
check |
All branches | Typecheck, unit tests, CDK synth |
deploy |
master push only |
Deploy to AWS via OIDC |
acceptance |
After deploy | End-to-end test against live AWS resources |
| Name | Type | Value |
|---|---|---|
AWS_ACCOUNT_ID |
Variable | Your AWS account ID |
AWS_REGION |
Variable | e.g. us-east-1 |
AWS_DEPLOY_ROLE_ARN |
Secret | ARN of the IAM role the pipeline assumes |
The deploy role requires permissions to deploy CDK stacks (CloudFormation, Lambda, SQS, DynamoDB, S3, Bedrock, IAM, etc.). The pipeline uses GitHub OIDC — no long-lived AWS credentials are stored.
npm test # unit tests (all src/**/*.test.ts)
npm run acceptance-test # end-to-end tests against live AWS (requires AWS credentials)Unit tests mock all AWS SDK clients using aws-sdk-client-mock. No AWS credentials are needed to run them.
The acceptance test (acceptance-test/index.test.ts) submits a synthetic alarm payload directly to the SQS queue and polls Step Functions until the execution completes, then verifies a Slack notification was sent.
npm run typecheck # TypeScript type checking without emitting
npm run synth # synthesise CDK CloudFormation templates (no deploy)
npm run deploy # deploy all stacks to AWSlib/
config.ts ← single source of truth for all config
cloudErrorStack.ts ← CDK stack definition
constructs/
analyser/ ← SQS queue, Lambda, Step Functions, DynamoDB
indexer/ ← Lambda, API Gateway, S3, DynamoDB
log-forwarder/ ← Lambda
subscription-registrar/ ← Lambda + EventBridge schedule
webhook-registrar/ ← Lambda + EventBridge schedule
observability/ ← CloudWatch dashboard
src/
logger.ts ← structured JSON logger (shared)
types.ts ← shared TypeScript interfaces
analyser/
handler.ts ← SQS consumer, dedup, Step Functions trigger
taskHandler.ts ← Step Functions task executor (all 5 steps)
analysis/llm.ts ← Claude invocation + response parsing
deduplication/deduplicator.ts← DynamoDB-backed error dedup
logs/logFetcher.ts ← CloudWatch Logs Insights queries
logs/logGroupResolver.ts ← maps alarm metrics to log group names
notification/slack.ts ← Slack Block Kit message builder + poster
retrieval/retriever.ts ← semantic search via Bedrock KB
retrieval/reranker.ts ← Bedrock Rerank API
indexer/
handler.ts ← GitHub webhook handler
chunking/chunker.ts ← code splitting into chunks
github/githubClient.ts ← GitHub API + webhook signature verification
hashing/merkle.ts ← SHA-256 file hashing + DynamoDB persistence
vectors/vectorStore.ts ← S3 upload + Bedrock KB ingestion + query
log-forwarder/handler.ts ← CloudWatch Logs subscription filter consumer
subscription-registrar/handler.ts ← attaches subscription filters to log groups
webhook-registrar/
handler.ts ← registers GitHub webhooks on repos
githubRepos.ts ← GitHub API client for repo/webhook management
acceptance-test/
index.test.ts ← end-to-end pipeline test
Based on light usage: a solo developer, ~5 repos, ~10 error analyses per day (300/month), ~50 code pushes per month.
The vector store backing the Bedrock Knowledge Base is S3 Vectors — a pay-per-use model with no fixed minimum cost. This keeps the bill very low.
| Service | Cost/month | Notes |
|---|---|---|
| Bedrock — Claude inference | ~$3–5 | Opus 4.7 for high-severity alarms only; Haiku 4.5 ($0.80/$4 per M tokens) for the bulk of standard alarms, across ~300 analyses |
| Bedrock — Rerank | ~$1 | Cohere Rerank 3.5 via Bedrock, ~10 chunks per analysis |
| S3 Vectors (KB vector store) | ~$0.01 | ~$2.50 per million API requests + $0.06/GB storage — negligible at this scale |
| Bedrock — Titan Embeddings | ~$0.02 | Embeddings on code pushes only |
| CloudWatch Logs | ~$0.50 | Log ingestion + Insights queries |
| Route 53 | ~$0.50 | Hosted zone (likely already paying this) |
| SES | ~$0.03 | ~300 notification emails/month |
| Step Functions Express | ~$0.01 | ~3,000 state transitions/month |
| Lambda | ~$0 | Comfortably within the 1M request / 400K GB-second free tier |
| SQS | ~$0 | Within the 1M request free tier |
| DynamoDB | ~$0 | Within the 25 GB / 25 RCU/WCU free tier |
| S3 | ~$0.05 | Code chunk JSON objects |
| API Gateway | ~$0.01 | Minimal webhook invocations |
| EventBridge | ~$0 | Scheduled rules are free |
| Total | ~$5–7/month |
The dominant costs are Bedrock inference (Claude) and reranking. Everything else is negligible. Costs scale linearly with the number of error analyses — roughly $0.015–0.02 per analysis.