Skip to content

feat(workstream-c): add cheatsheet categorizer and grouping#934

Draft
shreeshtripurwarcomp23-coder wants to merge 2 commits into
OWASP:mainfrom
shreeshtripurwarcomp23-coder:workstream-c-clean
Draft

feat(workstream-c): add cheatsheet categorizer and grouping#934
shreeshtripurwarcomp23-coder wants to merge 2 commits into
OWASP:mainfrom
shreeshtripurwarcomp23-coder:workstream-c-clean

Conversation

@shreeshtripurwarcomp23-coder

Copy link
Copy Markdown
Contributor
  • Implement categorize_cheatsheet() with 29-label controlled taxonomy
  • Implement group_cheatsheets() with stable sha256-based group IDs
  • Deterministic keyword/rule baseline, no LLM dependency
  • LLM-optional path with safe fallback on failure
  • Validate all CheatsheetRecord fields in post_init
  • 50 tests covering all acceptance criteria from RFC Issue C

CheatsheetRecord uses local stub pending Workstream B merge.

- Implement categorize_cheatsheet() with 29-label controlled taxonomy
- Implement group_cheatsheets() with stable sha256-based group IDs
- Deterministic keyword/rule baseline, no LLM dependency
- LLM-optional path with safe fallback on failure
- Validate all CheatsheetRecord fields in __post_init__
- 50 tests covering all acceptance criteria from RFC Issue C

CheatsheetRecord uses local stub pending Workstream B merge.
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 3362e735-cf57-4a82-8a68-efed34dd28da

📥 Commits

Reviewing files that changed from the base of the PR and between f91b995 and f8472d4.

📒 Files selected for processing (2)
  • application/tests/test_cheatsheet_categorizer.py
  • application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py
  • application/tests/test_cheatsheet_categorizer.py

Summary by CodeRabbit

Release Notes

  • New Features

    • Added an automated cheatsheet categorization system that assigns categories using keyword rules, with optional AI-powered labeling.
    • Added grouping for categorized cheatsheets, producing deterministic groupings for consistent results.
  • Tests

    • Added a comprehensive test suite covering taxonomy rules, deterministic categorization, uncategorized fallback behavior, optional AI labeling flows (including safe fallbacks), and stable group ID generation.

Walkthrough

Adds a new cheatsheet_categorizer.py module defining a controlled TAXONOMY list, CheatsheetRecord and CheatsheetGroup dataclasses, deterministic keyword-based categorization, optional LLM-based categorization with fallback, and a grouping function. A comprehensive unittest module covering taxonomy integrity, all categorization paths, grouping semantics, and internal helpers is added alongside.

Changes

Cheatsheet Categorization System

Layer / File(s) Summary
Taxonomy constants and data contracts
application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py, application/tests/test_cheatsheet_categorizer.py
TAXONOMY, UNCATEGORIZED, CheatsheetRecord with __post_init__ field validation, and CheatsheetGroup with sha256-based make_group_id are defined; tests assert taxonomy integrity (uniqueness, lowercase, minimum size) and make_group_id determinism/formatting (order-independence, 12-char lowercase hex).
Deterministic + LLM categorization and helpers
application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py, application/tests/test_cheatsheet_categorizer.py
_build_searchable_text, _deterministic_categorize (sorted/deduped keyword matcher with UNCATEGORIZED fallback), _validate_labels (taxonomy filter + dedup), and categorize_cheatsheet (LLM-first with exception/invalid/empty fallback to deterministic) are implemented; tests cover all code paths including use_llm=False guard and spot-check deterministic mappings.
Grouping logic and tests
application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py, application/tests/test_cheatsheet_categorizer.py
group_cheatsheets buckets records by label-set group_id and returns groups sorted by group_id; tests verify co-grouping by identical label sets, separation for different labels, uncategorized placement, full record coverage, sorted output, and empty-input behavior.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.28% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding cheatsheet categorizer and grouping functionality with clear feature attribution to workstream-c.
Description check ✅ Passed The description provides detailed context about the implementation, including key features like the 29-label taxonomy, LLM-optional path, validation mechanisms, and test coverage.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@application/tests/test_cheatsheet_categorizer.py`:
- Around line 42-48: The _make_record helper function creates CheatsheetRecord
instances with an empty summary string, but CheatsheetRecord.__post_init__
enforces that summary must be non-empty, causing construction to fail before
test assertions run. Replace the empty string assignment `summary=""` in the
_make_record function with a minimal non-empty placeholder value (such as a
single space, period, or descriptive placeholder text like "Test summary") to
satisfy the validation requirement.

In
`@application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py`:
- Around line 194-210: The __post_init__ method in CheatsheetRecord currently
validates only string-typed fields but does not validate the element types of
list-typed fields (headings and category_hints). This causes runtime crashes
when non-string items reach code expecting to call " ".join() on these fields.
Add validation in __post_init__ to ensure that headings and category_hints are
lists containing only string elements, raising a ValueError with a clear message
if any element is not a string, so that parser input validation fails fast at
construction time rather than later during string joining operations.
- Around line 366-381: The `_validate_labels` function currently allows
`uncategorized` to be returned alongside other valid labels, which violates the
sentinel semantics where `uncategorized` should only be returned when it is the
sole valid label. Modify the function to add logic after building the deduped
list: if `uncategorized` (or the appropriate constant reference from TAXONOMY)
is present in the result AND there are other valid labels alongside it, remove
the `uncategorized` entry from the returned list. This ensures that
`uncategorized` is only returned when no other categories match, preserving its
role as a fallback indicator and preventing inconsistent downstream grouping and
UX.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: ca9bd7d2-9738-4eb1-9bf6-5c1b32faf0db

📥 Commits

Reviewing files that changed from the base of the PR and between e853cd3 and f91b995.

📒 Files selected for processing (2)
  • application/tests/test_cheatsheet_categorizer.py
  • application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py

Comment thread application/tests/test_cheatsheet_categorizer.py
Comment thread application/utils/external_project_parsers/parsers/cheatsheet_categorizer.py Outdated
@shreeshtripurwarcomp23-coder shreeshtripurwarcomp23-coder marked this pull request as draft June 15, 2026 05:06
@shreeshtripurwarcomp23-coder shreeshtripurwarcomp23-coder marked this pull request as ready for review June 15, 2026 05:10
@shreeshtripurwarcomp23-coder shreeshtripurwarcomp23-coder marked this pull request as draft June 15, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant