feat: implement structured extraction checkpoints B3, B4 and B5#921
feat: implement structured extraction checkpoints B3, B4 and B5#921Abhijeet2409 wants to merge 12 commits into
Conversation
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughCheatsheet extraction now tracks fallback usage per-call via a computed flag instead of a module constant. Title and summary extractors raise ChangesFallback handling and extraction robustness
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py (1)
12-13:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winEmpty
##headings are currently dropped despite the “can be empty” contract.Line 99 states empty headings are allowed, but
_HEADING_RErequires at least one character (.+), so##/##headings are excluded instead of being preserved as empty strings.Suggested fix
-_HEADING_RE = re.compile(r"^##\s+(?P<heading>.+)$", re.MULTILINE) +_HEADING_RE = re.compile(r"^##(?:\s+(?P<heading>.*))?$", re.MULTILINE)- headings = [m.group("heading").strip() for m in _HEADING_RE.finditer(markdown)] + headings = [(m.group("heading") or "").strip() for m in _HEADING_RE.finditer(markdown)]Also applies to: 99-100
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py` around lines 12 - 13, The `_HEADING_RE` and `_ANY_HEADING_RE` currently require at least one non-space character (using `.+`) which drops empty `##` headings; change their patterns to allow empty heading text (replace `.+` with `.*`) while keeping the `(?P<heading>)` group name intact so empty headings capture as an empty/whitespace string, and ensure downstream code that uses `heading` (e.g., any trimming or truthiness checks) preserves empty headings per the comment at line 99 (trim whitespace but treat the result as "" rather than dropping the heading).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@application/utils/external_project_parsers/parsers/cheatsheet_extractor.py`:
- Around line 12-13: The `_HEADING_RE` and `_ANY_HEADING_RE` currently require
at least one non-space character (using `.+`) which drops empty `##` headings;
change their patterns to allow empty heading text (replace `.+` with `.*`) while
keeping the `(?P<heading>)` group name intact so empty headings capture as an
empty/whitespace string, and ensure downstream code that uses `heading` (e.g.,
any trimming or truthiness checks) preserves empty headings per the comment at
line 99 (trim whitespace but treat the result as "" rather than dropping the
heading).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 05aa9804-df31-406d-af28-37a40b53f9af
📒 Files selected for processing (1)
application/utils/external_project_parsers/parsers/cheatsheet_extractor.py
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
a03b674 to
a5cd582
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/rfc-structured-extraction.md`:
- Around line 251-255: The inline code examples in the malformed-heading section
(the examples with backticks around ` # My Title`, ` ## My Heading`, and
`##Authentication`) violate markdownlint MD038 because of the leading spaces
inside the backticks. Fix this by removing the leading spaces from within the
backticks themselves (so backticks contain only the actual code without padding)
or move these examples into a fenced code block instead of using inline code
formatting to better display the problematic patterns while maintaining linting
compliance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: edf43053-98b1-492e-8b9c-4cf86693987b
📒 Files selected for processing (2)
application/tests/cheatsheet_extractor_test.pydocs/rfc-structured-extraction.md
🚧 Files skipped from review as they are similar to previous changes (1)
- application/tests/cheatsheet_extractor_test.py
Signed-off-by: Abhijeet Saharan <abhijeetsaharan2236@gmail.com>
8ae3c36 to
27a7f44
Compare
Signed-off-by: Abhijeet <abhijeetsaharan2236@gmail.com>
Summary
Implements Workstream B Checkpoints B3 , B4 and B5 from the OWASP Cheat Sheet to CRE Mapping pipeline RFC.
This PR is built on top of the existing Workstream B1/B2 extraction implementation (PR #912).
Changes
cheatsheet_extractor.py( Checkpoint B3 )Added deterministic fallback handling for malformed or irregular markdown inputs.
Implemented:
fallback_usedUpdated regex patterns to correctly handle malformed markdown heading formats such as
##Headingor headings with leading whitespace like##Heading.No fallback handling was required for:
source_idhyperlinkraw_markdown_pathsince these are deterministically derived from the source path provided by Workstream A.
Added
1.
cheatsheet_extractor_test.py( Checkpoint B4 )Added dedicated extractor tests covering normal, malformed, fallback, and empty-markdown scenarios.
Test coverage includes:
Additional validation coverage includes:
SUMMARY_MAX_LENGTHfallback_usedTest Results
All tests passed correctly.
2.
rfc-structured-extraction.md( Checkpoint B5 )This document serves as a reference for: