feat: implement structured extraction checkpoints B3, B4 and B5 by Abhijeet2409 · Pull Request #921 · OWASP/OpenCRE

Abhijeet2409 · 2026-06-09T05:39:51Z

Summary

Implements Workstream B Checkpoints B3 , B4 and B5 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.

This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).

Changes

`cheatsheet_extractor.py` ( Checkpoint B3 )

Added deterministic fallback handling for malformed or irregular markdown inputs.

Implemented:

fallback title extraction
fallback summary extraction
consistent fallback metadata handling through fallback_used

Updated regex patterns to correctly handle malformed markdown heading formats such as ##Heading or headings with leading whitespace like ##Heading.

No fallback handling was required for:

source_id
hyperlink
raw_markdown_path

since these are deterministically derived from the source path provided by Workstream A.

Added

1. `cheatsheet_extractor_test.py` ( Checkpoint B4 )

Added dedicated extractor tests covering normal, malformed, fallback, and empty-markdown scenarios.

Test coverage includes:

Normal — standard OWASP cheatsheet, ensures no fallback behavior is triggered
Missing H1 — verifies title fallback behavior
Body under H1 — verifies fallback summary extraction
Empty markdown — verifies fallback title, fallback summary, and empty headings handling
Malformed headings — verifies malformed markdown headings are extracted correctly using updated regex handling

Additional validation coverage includes:

bounded summary validation through SUMMARY_MAX_LENGTH
fallback metadata validation through fallback_used

Test Results

All tests passed correctly.

2. `rfc-structured-extraction.md` ( Checkpoint B5 )

This document serves as a reference for:

users trying to understand Workstream B extraction behavior
RFC contributors working on downstream workstreams
future maintainers debugging parser and fallback behavior

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

coderabbitai · 2026-06-09T05:39:59Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Cheatsheet extraction now tracks fallback usage per-call via a computed flag instead of a module constant. Title and summary extractors raise ValueError on missing content, with new fallback helpers providing defaults. The main orchestrator wraps extraction in try/except, logs errors, applies fallback helpers on failure, and records fallback usage in metadata. Comprehensive tests cover multiple markdown variants, and an RFC document specifies the extraction contract.

Changes

Fallback handling and extraction robustness

Layer / File(s)	Summary
Parser regex, title/summary extraction, and fallbacks `application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`	Update heading/title regexes for permissive whitespace/marker matching; `_extract_title` and `_extract_summary` now raise `ValueError` when missing; add `_fallback_title` and `_fallback_summary` helpers.
Main orchestration: try/except and metadata wiring `application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`	`extract_cheatsheet_record` uses a per-call `fallback_used` flag, wraps extraction in `try/except ValueError`, logs failures, applies fallback helpers, and writes `metadata["fallback_used"]`.
Unit tests for extraction variants `application/tests/cheatsheet_extractor_test.py`	New unittest module exercising normal, missing H1, empty input, body-under-H1, and malformed-heading scenarios; asserts `title`, `summary`, `headings`, and `metadata["fallback_used"]` for each fixture.
RFC structured extraction specification `docs/rfc-structured-extraction.md`	Document CheatsheetRecord contract, fallback behavior decision tree, and three extraction examples; specify deterministic field derivations and malformed heading handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

OWASP/OpenCRE#912: Further refinements to cheatsheet_extractor.py's title/"Introduction" summary extraction, fallback logic, and fallback_used handling.

Suggested reviewers

Pa04rth

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	⚠️ Warning	The PR title refers to implementing checkpoints B3, B4, and B5, but the PR objectives and code changes only demonstrate implementation of checkpoints B3 and B4 (with no evidence of B5 work in the provided changes).	Update the PR title to remove the reference to B5, or ensure that B5 implementation is actually included in the changeset and documented in the PR objectives.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The pull request description clearly relates to the changeset, describing the implementation of Workstream B Checkpoints B3, B4, and B5 with specific details about fallback handling, regex updates, test coverage, and documentation.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)

12-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty ## headings are currently dropped despite the “can be empty” contract.

Line 99 states empty headings are allowed, but _HEADING_RE requires at least one character (.+), so ## / ## headings are excluded instead of being preserved as empty strings.

Suggested fix

-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE)
+_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)

-    headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)]
+    headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]

Also applies to: 99-100

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`
around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 05aa9804-df31-406d-af28-37a40b53f9af

📥 Commits

Reviewing files that changed from the base of the PR and between b637225 and a76be61.

📒 Files selected for processing (1)

application/utils/external_project_parsers/parsers/cheatsheet_extractor.py

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/rfc-structured-extraction.md`:
- Around line 251-255: The inline code examples in the malformed-heading section
(the examples with backticks around `   # My Title`, `   ## My Heading`, and
`##Authentication`) violate markdownlint MD038 because of the leading spaces
inside the backticks. Fix this by removing the leading spaces from within the
backticks themselves (so backticks contain only the actual code without padding)
or move these examples into a fenced code block instead of using inline code
formatting to better display the problematic patterns while maintaining linting
compliance.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: edf43053-98b1-492e-8b9c-4cf86693987b

📥 Commits

Reviewing files that changed from the base of the PR and between a5cd582 and 8ae3c36.

📒 Files selected for processing (2)

application/tests/cheatsheet_extractor_test.py
docs/rfc-structured-extraction.md

🚧 Files skipped from review as they are similar to previous changes (1)

application/tests/cheatsheet_extractor_test.py

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Signed-off-by: Abhijeet <abhijeetsaharan2236@gmail.com>

Abhijeet2409 added 5 commits June 9, 2026 11:00

feat: implement structured extraction checkpoints B1 and B2

ade7c18

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

docs: add docstrings

a9e54a3

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

fix: validate normalized string field values correctly

26d7e92

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

fix: validate normalized string field values correctly

dc9d2d0

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

feat: implement structured extraction checkpoint B3

a76be61

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Abhijeet2409 mentioned this pull request Jun 9, 2026

feat: implement structured extraction checkpoint B3 Abhijeet2409/OpenCRE#1

Closed

Abhijeet2409 marked this pull request as ready for review June 9, 2026 05:44

coderabbitai Bot reviewed Jun 9, 2026

View reviewed changes

Merge branch 'main' into feature/structured-extraction-b3

2ba1f4b

Abhijeet2409 changed the title ~~feat: implement structured extraction checkpoint B3~~ feat: implement structured extraction checkpoint B3 & B4 Jun 11, 2026

feat: add B4 tests for cheatsheet extractor

a5cd582

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Abhijeet2409 force-pushed the feature/structured-extraction-b3 branch from a03b674 to a5cd582 Compare June 11, 2026 08:41

Abhijeet2409 added 3 commits June 11, 2026 15:32

Merge branch 'main' into feature/structured-extraction-b3

c5edfea

Merge branch 'main' into feature/structured-extraction-b3

48b9b8a

Merge branch 'main' into feature/structured-extraction-b3

e944958

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

Comment thread docs/rfc-structured-extraction.md Outdated

docs: add checkpoint B5 documentation and refine test comments

27a7f44

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>

Abhijeet2409 force-pushed the feature/structured-extraction-b3 branch from 8ae3c36 to 27a7f44 Compare June 14, 2026 09:03

Abhijeet2409 changed the title ~~feat: implement structured extraction checkpoint B3 & B4~~ feat: implement structured extraction checkpoint B3 , B4 & B5 Jun 14, 2026

Abhijeet2409 changed the title ~~feat: implement structured extraction checkpoint B3 , B4 & B5~~ feat: implement structured extraction checkpoints B3, B4 and B5 Jun 14, 2026

docs: refine malformed heading behavior notes

8a3ac8f

Signed-off-by: Abhijeet <abhijeetsaharan2236@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement structured extraction checkpoints B3, B4 and B5#921

feat: implement structured extraction checkpoints B3, B4 and B5#921
Abhijeet2409 wants to merge 12 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3

Abhijeet2409 commented Jun 9, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading

Reviews paused

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Abhijeet2409 commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

cheatsheet_extractor.py ( Checkpoint B3 )

Added

1. cheatsheet_extractor_test.py ( Checkpoint B4 )

Test Results

2. rfc-structured-extraction.md ( Checkpoint B5 )

Uh oh!

coderabbitai Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Abhijeet2409 commented Jun 9, 2026 •

edited

Loading

`cheatsheet_extractor.py` ( Checkpoint B3 )

1. `cheatsheet_extractor_test.py` ( Checkpoint B4 )

2. `rfc-structured-extraction.md` ( Checkpoint B5 )

coderabbitai Bot commented Jun 9, 2026 •

edited

Loading