Skip to content

feat: implement structured extraction checkpoints B3, B4 and B5#921

Open
Abhijeet2409 wants to merge 12 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3
Open

feat: implement structured extraction checkpoints B3, B4 and B5#921
Abhijeet2409 wants to merge 12 commits into
OWASP:mainfrom
Abhijeet2409:feature/structured-extraction-b3

Conversation

@Abhijeet2409

@Abhijeet2409 Abhijeet2409 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements Workstream B Checkpoints B3 , B4 and B5 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.

This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).

Changes

cheatsheet_extractor.py ( Checkpoint B3 )

Added deterministic fallback handling for malformed or irregular markdown inputs.

Implemented:

  • fallback title extraction
  • fallback summary extraction
  • consistent fallback metadata handling through fallback_used

Updated regex patterns to correctly handle malformed markdown heading formats such as ##Heading or headings with leading whitespace like ##Heading.

No fallback handling was required for:

  • source_id
  • hyperlink
  • raw_markdown_path

since these are deterministically derived from the source path provided by Workstream A.


Added

1. cheatsheet_extractor_test.py ( Checkpoint B4 )

Added dedicated extractor tests covering normal, malformed, fallback, and empty-markdown scenarios.

Test coverage includes:

  • Normal — standard OWASP cheatsheet, ensures no fallback behavior is triggered
  • Missing H1 — verifies title fallback behavior
  • Body under H1 — verifies fallback summary extraction
  • Empty markdown — verifies fallback title, fallback summary, and empty headings handling
  • Malformed headings — verifies malformed markdown headings are extracted correctly using updated regex handling

Additional validation coverage includes:

  • bounded summary validation through SUMMARY_MAX_LENGTH
  • fallback metadata validation through fallback_used

Test Results

All tests passed correctly.

image

2. rfc-structured-extraction.md ( Checkpoint B5 )

This document serves as a reference for:

  • users trying to understand Workstream B extraction behavior
  • RFC contributors working on downstream workstreams
  • future maintainers debugging parser and fallback behavior

Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Cheatsheet extraction now tracks fallback usage per-call via a computed flag instead of a module constant. Title and summary extractors raise ValueError on missing content, with new fallback helpers providing defaults. The main orchestrator wraps extraction in try/except, logs errors, applies fallback helpers on failure, and records fallback usage in metadata. Comprehensive tests cover multiple markdown variants, and an RFC document specifies the extraction contract.

Changes

Fallback handling and extraction robustness

Layer / File(s) Summary
Parser regex, title/summary extraction, and fallbacks
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
Update heading/title regexes for permissive whitespace/marker matching; _extract_title and _extract_summary now raise ValueError when missing; add _fallback_title and _fallback_summary helpers.
Main orchestration: try/except and metadata wiring
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
extract_cheatsheet_record uses a per-call fallback_used flag, wraps extraction in try/except ValueError, logs failures, applies fallback helpers, and writes metadata["fallback_used"].
Unit tests for extraction variants
application/tests/cheatsheet_extractor_test.py
New unittest module exercising normal, missing H1, empty input, body-under-H1, and malformed-heading scenarios; asserts title, summary, headings, and metadata["fallback_used"] for each fixture.
RFC structured extraction specification
docs/rfc-structured-extraction.md
Document CheatsheetRecord contract, fallback behavior decision tree, and three extraction examples; specify deterministic field derivations and malformed heading handling.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • OWASP/OpenCRE#912: Further refinements to cheatsheet_extractor.py's title/"Introduction" summary extraction, fallback logic, and fallback_used handling.

Suggested reviewers

  • Pa04rth
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ⚠️ Warning The PR title refers to implementing checkpoints B3, B4, and B5, but the PR objectives and code changes only demonstrate implementation of checkpoints B3 and B4 (with no evidence of B5 work in the provided changes). Update the PR title to remove the reference to B5, or ensure that B5 implementation is actually included in the changeset and documented in the PR objectives.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly relates to the changeset, describing the implementation of Workstream B Checkpoints B3, B4, and B5 with specific details about fallback handling, regex updates, test coverage, and documentation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)

12-13: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Empty ## headings are currently dropped despite the “can be empty” contract.

Line 99 states empty headings are allowed, but _HEADING_RE requires at least one character (.+), so ## / ## headings are excluded instead of being preserved as empty strings.

Suggested fix
-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE)
+_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)
-    headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)]
+    headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]

Also applies to: 99-100

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`
around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 05aa9804-df31-406d-af28-37a40b53f9af

📥 Commits

Reviewing files that changed from the base of the PR and between b637225 and a76be61.

📒 Files selected for processing (1)
  • application/utils/external_project_parsers/parsers/cheatsheet_extractor.py

@Abhijeet2409 Abhijeet2409 changed the title feat: implement structured extraction checkpoint B3 feat: implement structured extraction checkpoint B3 & B4 Jun 11, 2026
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
@Abhijeet2409 Abhijeet2409 force-pushed the feature/structured-extraction-b3 branch from a03b674 to a5cd582 Compare June 11, 2026 08:41

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/rfc-structured-extraction.md`:
- Around line 251-255: The inline code examples in the malformed-heading section
(the examples with backticks around `   # My Title`, `   ## My Heading`, and
`##Authentication`) violate markdownlint MD038 because of the leading spaces
inside the backticks. Fix this by removing the leading spaces from within the
backticks themselves (so backticks contain only the actual code without padding)
or move these examples into a fenced code block instead of using inline code
formatting to better display the problematic patterns while maintaining linting
compliance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: edf43053-98b1-492e-8b9c-4cf86693987b

📥 Commits

Reviewing files that changed from the base of the PR and between a5cd582 and 8ae3c36.

📒 Files selected for processing (2)
  • application/tests/cheatsheet_extractor_test.py
  • docs/rfc-structured-extraction.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • application/tests/cheatsheet_extractor_test.py

Comment thread docs/rfc-structured-extraction.md Outdated
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
@Abhijeet2409 Abhijeet2409 force-pushed the feature/structured-extraction-b3 branch from 8ae3c36 to 27a7f44 Compare June 14, 2026 09:03
@Abhijeet2409 Abhijeet2409 changed the title feat: implement structured extraction checkpoint B3 & B4 feat: implement structured extraction checkpoint B3 , B4 & B5 Jun 14, 2026
@Abhijeet2409 Abhijeet2409 changed the title feat: implement structured extraction checkpoint B3 , B4 & B5 feat: implement structured extraction checkpoints B3, B4 and B5 Jun 14, 2026
Signed-off-by: Abhijeet <abhijeetsaharan2236@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant