Skip to content

[awsfirehose] Fix VPC flow log detection in firehose routing pipeline#18395

Open
zmoog wants to merge 11 commits intomainfrom
fix/firehose-vpcflow-operator-precedence
Open

[awsfirehose] Fix VPC flow log detection in firehose routing pipeline#18395
zmoog wants to merge 11 commits intomainfrom
fix/firehose-vpcflow-operator-precedence

Conversation

@zmoog
Copy link
Copy Markdown
Contributor

@zmoog zmoog commented Apr 14, 2026

Summary

  • Fixed broken VPC flow log detection condition in the firehose ingest pipeline that had two bugs: operator precedence (&& vs ||) and incorrect string matching ("eni- with leading quote never matched inside JSON-wrapped messages)
  • Replaced the fragile condition with robust detection covering all VPC flow log variants (v2-v10):
    • tgw- + transitgateway — transit gateway flow logs (default format includes resource-type: TransitGateway)
    • eni- — standard VPC flow logs, requires a secondary marker (accept, reject, nodata, skipdata)
    • nat- — v9 regional NAT gateway flow logs (where interface-id is -), requires a secondary marker
  • Expanded VPC flow log test suite from 1 case to 16, covering all versions (v2-v10) and all field formats supported by the downstream vpcflow pipeline (14, 17, 18, 21, 29, 30, 36, 39, 40 tokens), using examples from the official AWS VPC Flow Logs documentation
  • Added a regression test to the firewall test suite: a firewall log with "action":"reject" that must route to aws.firewall_logs, not aws.vpcflow — this is the exact scenario the original operator precedence bug caused
  • Added a "How log routing works" section to the docs explaining the detection method and format requirements for each log type
  • Bumped package version to 1.9.1

Known limitation

VPC flow logs using custom formats that exclude both the action and log-status fields (e.g., the 6-field zonal NAT gateway format) cannot be detected through content matching alone. The default format always includes action and log-status, so this only affects deliberately minimal custom configurations. Users with such formats should use the es_datastream_name delivery stream parameter for routing instead.

Test plan

  • elastic-package test pipeline -v -d logs passes all 22 tests (16 vpcflow + 2 firewall + 8 other log types unchanged)
  • No regressions in WAF, CloudTrail, Firewall, Route53, API Gateway, S3, CloudFront, or ELB detection
  • Token counts verified against downstream aws.vpcflow pipeline dissect patterns

Closes #18110

🤖 Generated with Claude Code

@zmoog zmoog requested a review from a team as a code owner April 14, 2026 20:47
@zmoog zmoog self-assigned this Apr 14, 2026
@zmoog zmoog changed the title Fix VPC flow log detection in firehose routing pipeline [awsfirehose] Fix VPC flow log detection in firehose routing pipeline Apr 14, 2026
@zmoog zmoog force-pushed the fix/firehose-vpcflow-operator-precedence branch 2 times, most recently from 573b050 to 9848b2b Compare April 14, 2026 21:33
@elastic-vault-github-plugin-prod
Copy link
Copy Markdown

🚀 Benchmarks report

To see the full report comment with /test benchmark fullreport

zmoog and others added 5 commits April 15, 2026 01:12
The condition for detecting VPC flow logs had two bugs:

1. Operator precedence: missing parentheses around the `accept || reject`
   check caused any message containing "reject" to be misrouted to
   `aws.vpcflow` regardless of other conditions.

2. Broken string matching: the checks for `"eni-` and `"vpc-` (with
   leading double-quote) never matched because inside the JSON wrapper
   `{"message":"..."}`, ENI and VPC IDs are preceded by spaces, not
   quotes. The old test only passed as a side effect of bug #1.

The new condition checks for `eni-` (without quote prefix) as the primary
marker, combined with secondary markers that cover all VPC flow log
variants:
- `accept`/`reject` for standard flow logs with traffic data
- `nodata`/`skipdata` for records with no captured data
- `tgw-` for transit gateway flow logs (which lack ACCEPT/REJECT)

The test suite is expanded from 1 to 12 cases covering all formats
supported by the downstream vpcflow pipeline (14, 17, 21, 29, 36, 39,
40 fields), using examples from the official AWS documentation.

Closes #18110

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Regional NAT gateway VPC flow logs (v9) can have interface-id = "-"
with only nat-xxxxx as a resource identifier. Adding nat- as an
alternative primary marker ensures these records are correctly
routed to aws.vpcflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds test fixtures for the two missing VPC flow log versions:
- v4 (18 tokens): v2 fields + region, az-id, sublocation-type, sublocation-id
- v10 (30 tokens): v5 fields + encryption-status

All versions v2-v10 now have explicit test coverage (15 cases total).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Transit gateway flow logs can have both tgw-src-eni and tgw-dst-eni
set to "-", leaving no eni- in the record. Since tgw- is specific
enough on its own (CloudTrail is matched before this check), promote
it to a standalone primary path:

  tgw- || ((eni- || nat-) && (accept || reject || nodata || skipdata))

Adds test case #16: transit gateway with OK status and no ENI fields,
detected solely via tgw-.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…flow

Adds a firewall log event whose alert action is "reject" to the
firewall test suite. This is the exact scenario the original bug
caused: the old condition `... && contains('accept') ||
contains('reject')` would misroute any message containing "reject"
to aws.vpcflow due to operator precedence.

Since the VPC flow check runs before the firewall check in the
pipeline, this test will fail if the precedence bug is reintroduced
— the event would be routed to aws.vpcflow instead of the expected
aws.firewall_logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zmoog zmoog force-pushed the fix/firehose-vpcflow-operator-precedence branch from 9848b2b to 25fef9e Compare April 14, 2026 23:12
@andrewkroh andrewkroh added Integration:awsfirehose Amazon Data Firehose Team:obs-ds-hosted-services Observability Hosted Services team [elastic/obs-ds-hosted-services] labels Apr 15, 2026
Copy link
Copy Markdown
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add changelog record and bump version of the package


// AWS VPC Flow Logs
else if (message_lower.contains('"eni-') && message_lower.contains('"vpc-') && message_lower.contains('accept') || message_lower.contains('reject')) {
// AWS VPC Flow Logs (standard, transit gateway, and regional NAT gateway)
Copy link
Copy Markdown
Contributor

@tetianakravchenko tetianakravchenko Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zonal NAT gateway is also covered, isn't it?
can be added some clarification comment, like:

Suggested change
// AWS VPC Flow Logs (standard, transit gateway, and regional NAT gateway)
// AWS VPC Flow Logs:
// eni- → standard ENI-based records AND zonal NAT gateway (interface-id = eni-xxxxx)
// nat- → regional NAT gateway v9+ (interface-id = "-", resource-id = nat-xxxxx)
// tgw- → transit gateway (no action field; log-status may be OK with no ACCEPT/REJECT)

?
(if it is correct)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems so. Thanks for checking, taking a closer look at this use case.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the eni- part, we can only support the standard format (which includes accept/reject). We probably need to note this in the docs, if it's not there yet.

And yes, we should update the comment to reflect what we support now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment.

// AWS VPC Flow Logs
else if (message_lower.contains('"eni-') && message_lower.contains('"vpc-') && message_lower.contains('accept') || message_lower.contains('reject')) {
// AWS VPC Flow Logs (standard, transit gateway, and regional NAT gateway)
else if (message_lower.contains('tgw-') || ((message_lower.contains('eni-') || message_lower.contains('nat-')) && (message_lower.contains('accept') || message_lower.contains('reject') || message_lower.contains('nodata') || message_lower.contains('skipdata')))) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if tgw- as a standalone condition has a risk of matching non-vpc logs (that follow after this check in ingest pipeline)?
in https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-records-examples.html#flow-log-example-tgw is mentioned that action should be available

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, checking this as well!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tightened the transit gateway detection rule adding the resource-type.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, where transit gateway log samples are coming from? I see that test samples 1-8 are from https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs-records-examples.html, but the rest? are these real or reconstructed from spec?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used examples from the docs as much as possible, but some test cases have been reconstructed from the spec.

Let me see if I can find more examples, including outside the AWS docs. We can also set up a couple of VPCs with a transit gateway to collect some samples.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets start with what we have now and iterate over it 👍

zmoog and others added 4 commits April 16, 2026 11:52
Using tgw- as a standalone primary marker risks matching non-VPC
logs that happen to reference transit gateway resources (e.g., a
firewall log with "tgw-0abc" in a signature field).

Tighten the transit gateway path to require both tgw- and
transitgateway (the resource-type field value in the default TGW
flow log format). This eliminates false positives while still
covering all default-format transit gateway flow logs.

Known limitation: VPC flow logs using custom formats that exclude
both the action and log-status fields (e.g., the 6-field zonal NAT
gateway format from the AWS docs) cannot be detected through content
matching alone. The default format always includes action and
log-status, so this only affects deliberately minimal custom
configurations. Users with such formats should use stream-name-based
routing instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds a "How log routing works" section to the docs explaining how
each log type is detected and what format requirements users must
follow for auto-detection to work. Includes a note about using the
es_datastream_name parameter as a fallback for custom formats.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 16, 2026

Vale Linting Results

Summary: 4 warnings found

⚠️ Warnings (4)
File Line Rule Message
packages/awsfirehose/_dev/build/docs/README.md 32 Elastic.Latinisms Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.
packages/awsfirehose/_dev/build/docs/README.md 36 Elastic.Latinisms Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.
packages/awsfirehose/docs/README.md 32 Elastic.Latinisms Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.
packages/awsfirehose/docs/README.md 36 Elastic.Latinisms Latin terms and abbreviations are a common source of confusion. Use 'using' instead of 'via'.

The Vale linter checks documentation changes against the Elastic Docs style guide.

To use Vale locally or report issues, refer to Elastic style guide for Vale.

@zmoog
Copy link
Copy Markdown
Contributor Author

zmoog commented Apr 16, 2026

I added a new section in the docs to suggest how to successfully route logs using this integration.

@andrewkroh andrewkroh added the documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. label Apr 16, 2026
Copy link
Copy Markdown
Contributor

@tetianakravchenko tetianakravchenko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zmoog build is failing:

Error: checking package failed: checking readme files are up-to-date failed: files do not match

I believe elastic-package build should fix it

// AWS WAF Logs

// Initialize lower case version of the message to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence seems to be not finished, or?

Comment thread packages/awsfirehose/docs/README.md Outdated
each record to the appropriate integration dataset. The detection relies on known patterns in the
log message, such as field names, resource identifiers, or structural characteristics.

The following table describes how each log type is detected and what is required for routing
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this table can get quickly outdated, not sure if we should ducplicate in readme filtering logic
I think mentioning about the detecting, maybe 1 specific example and leaving a note about auto-detection not working should be good for the doc

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, after a second look I thought rh same: I'll only leave the recommendations on the formats.

Also rebuilds the docs and completes truncated comments.
@elasticmachine
Copy link
Copy Markdown

💚 Build Succeeded

History

cc @zmoog

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation. Applied to PRs that modify *.md files. Integration:awsfirehose Amazon Data Firehose Team:obs-ds-hosted-services Observability Hosted Services team [elastic/obs-ds-hosted-services]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bad if condition - firehose pipeline -> vpc flow log

4 participants