GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928
GSoC 2026 Module B — Week 2: Stage 1 regex filter + Stage 1.5 sanitize#928manshusainishab wants to merge 12 commits into
Conversation
…contract Establishes the data contract Module B consumes from Module A. ChangeRecord is a Pydantic v2 model matching A's actual emission shape: nested source (discriminated union on type for github/rss), span (chunk position + heading_path + char/line offsets), and locator (addressing scheme). Internal models ClassifyResult and QueuePayload prep for later stages. hashing.py provides normalize_text + compute_content_hash since Module A does not emit content_hash; B computes its own (SHA-256 of normalized text) for use as the knowledge_queue dedup key. 22 unittest cases cover the round-trip, the discriminated union, hash determinism, normalization rules, code-fence preservation, and idempotency. Full make test: 271 passing, no regressions. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
…ifact module_a_mock.jsonl: Module A's canonical 20-record mock shared 2026-05-29, saved as JSONL (one record per line per the contract). Becomes a permanent integration-test fixture for B's parser and a reference shape for the Module A contributor. module_a_contract.schema.json: JSON Schema generated from B's Pydantic ChangeRecord model via model_json_schema(). 246 lines covering all four nested types (ChangeRecord, GithubSource, RssSource, Span, Locator). Source of truth for cross-module CI validation. Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
build_labeled_dataset.py: PyGithub-based harvester that acts as Module A's stand-in for producing benchmark data. Fetches recent commits from 4 OWASP repos (WSTG, ASVS, CheatSheetSeries, SAMM), applies the contract's normalization rules, splits into chunks at markdown heading boundaries with a fence-aware stack-based walker that tracks heading_path + char/line offsets, and emits records in Module A's actual nested shape. Pluggable via GITHUB_TOKEN env var. Reproducible: python scripts/build_labeled_dataset.py regenerates the candidate set. label_dataset.py: resumable interactive TUI for manual classification. Atomic-writes labeled_data.json after every keystroke; lookup by chunk_id for resume. Embeds the recall-first definition (agreed with maintainer 2026-06-01) so labelers see the rule front-of-mind: KNOWLEDGE for any chunk with security signal, NOISE only for pure organizational content. candidate_commits.json: 100 records, 25 per repo, all Pydantic-valid against ChangeRecord. 90/100 have non-empty heading_path; 10 multi-chunk artifacts captured. labeled_data.json: 100/100 labeled by hand under the recall-first rule. Distribution 55 KNOWLEDGE / 40 NOISE / 5 UNCERTAIN. Per-repo skew is visible: CheatSheetSeries 92% K, SAMM 0% K (the SAMM commits sampled landed entirely on Website/Sponsorship/meetings paths -- empirical input for Week 2's noise_patterns.yaml). Part of GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B.
Super-Linter (Black 24.4.2) flagged 4 files in the previous push. Applied `black` (same pinned version) to bring them in line with the repo's formatting standard. Cosmetic changes only: blank lines around section-separator comments, one multi-line dict join. No behavior or test changes -- `make test` remains 271 passing, 1 skip.
- Sort __all__ lists in hashing.py and schemas.py to satisfy Ruff RUF022. - Declare JSON Schema dialect ($schema = draft 2020-12, which is what Pydantic v2 model_json_schema() emits) on the contract artifact. - Wrap load_labeled() in scripts/label_dataset.py with try/except so a corrupted labeled_data.json prints an actionable hint instead of a raw JSONDecodeError stack trace. Deferred to Week 2 (will be addressed when we touch the harvester): - chunker should also track <pre> open/close, not just ``` fences - _split_chunk_by_size cursor arithmetic assumes \\n\\n separator even on hard-split sub-chunks Tests: 271 passing, 1 skip (unchanged). Black: clean.
Defensive text cleanup (PDF ligatures, zero-width chars, HTML, hyphenation). Vendored from rocklambros/TRACT under CC0; drops their whitespace-collapse step so structure (newlines, paragraphs) is preserved for Module B's LLM. 26 unit tests, all passing.
Path-based filter with extension/filename/glob deny rules and allow_overrides. Patterns are deliberately conservative under the recall-first labeling rule. 15 unit tests including >=90% rejection / 0% false-positive acceptance criteria.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (4)
Summary by CodeRabbitRelease Notes
WalkthroughThis PR introduces Module B, a noise/relevance filtering system for OpenCRE's scraper pipeline. It defines data contracts (ChangeRecord, Source union), implements text normalization and sanitization, provides path-based filtering via YAML rules, delivers comprehensive test coverage with fixtures, and includes tooling to harvest GitHub documentation and interactively label training datasets. ChangesModule B: Noise Filtering Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
application/utils/noise_filter/regex_filter.py (1)
40-60: 💤 Low valueClass name uses "Regex" but implementation uses
fnmatchglob patterns.The class is named
RegexFilterbut the implementation relies onfnmatchfor glob-style matching (not regular expressions). The behavior is correctly documented in both the module docstring andnoise_patterns.yaml, so this is purely a naming inconsistency. Consider renaming toPatternFilterorGlobFilterin a future refactor if it causes confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@application/utils/noise_filter/regex_filter.py` around lines 40 - 60, The class name RegexFilter is misleading because the implementation uses glob-style matching (fnmatch); rename the class to a clearer name (e.g., PatternFilter or GlobFilter) and update all references/imports and tests accordingly: change the class declaration (RegexFilter -> PatternFilter), update instantiation sites and any type annotations that reference RegexFilter (including functions, tests, and other modules that import it), and ensure exported names (if any) and the attribute patterns_path, deny_extensions, deny_filenames, deny_paths, and allow_overrides remain unchanged so behavior and YAML wiring continue to work.scripts/build_labeled_dataset.py (1)
520-528: 💤 Low valueConsider logging exception type for better debuggability.
The bare
Exceptioncatch is pragmatic for this non-critical rate limit display, but logging the exception class name would help diagnose unexpected failures without changing the control flow.🔧 Proposed improvement
except Exception as e: - print(f"(could not read rate limit: {e})") + print(f"(could not read rate limit: {type(e).__name__}: {e})")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@scripts/build_labeled_dataset.py` around lines 520 - 528, The except block that catches Exception when reading the GitHub rate limit (in the try/except around gh.get_rate_limit(), rl, core, and the print of rate limits) should include the exception class name in the log message for better debuggability; update the except clause to format the message with both the exception type (e.g., using e.__class__.__name__ or type(e).__name__) and the exception message so the printed line shows the error class and details without changing control flow.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@application/tests/noise_filter/sanitize_test.py`:
- Line 35: The test contains raw U+200B characters in string literals (e.g., the
variable text in sanitize_test.py) which trigger Ruff PLE2515; replace each raw
zero-width-space character with an explicit Unicode escape (use \u200B) in those
string literals (including the other occurrence around line 105) so the runtime
value is unchanged but the source contains no literal U+200B characters, then
run make lint to verify the file passes.
In `@application/utils/noise_filter/sanitize.py`:
- Line 56: Replace the invisible literal zero‑width characters in the
_ZERO_WIDTH_RE regex with explicit Unicode escape sequences so the pattern is
readable and maintainable; update the _ZERO_WIDTH_RE definition (the re.compile
call for _ZERO_WIDTH_RE in sanitize.py) to use a raw string containing the
specific escapes (e.g. \u200B, \u200C, \u200D, \uFEFF or other needed zero‑width
codepoints) inside the character class, preserving the re.Pattern[str]
annotation and behavior.
In `@application/utils/noise_filter/schemas.py`:
- Around line 84-97: Locator currently requires a path for every kind which
breaks non-path schemes; change Locator into a discriminated union using the
existing discriminator field "kind" (keep the base class Locator as a BaseModel
with model_config and common fields like kind and id but make path
optional/absent there), then add concrete subclasses such as RepoPathLocator
(kind="repo_path") that require path: str and id: str, and other scheme-specific
classes like FeedPostLocator with their own fields; update any type
annotations/usages referring to Locator (e.g., RegexFilter.is_noise_record) to
accept the union type and regenerate
docs/gsoc_2026_module_b/module_a_contract.schema.json so the public schema
reflects the discriminated models.
- Around line 72-78: The Span model currently allows inconsistent values (e.g.,
index == total or end < start); add validation to the model (use Pydantic
`@root_validator` or field `@validator` in the Span class) to enforce: index < total
(0 <= index and index must be strictly less than total), if either
start_char_idx or end_char_idx is set then both must be set and end_char_idx >=
start_char_idx, and similarly for start_line/end_line (both present or both None
and end_line >= start_line). Implement these checks in a single root validator
(e.g., validate_span) that raises ValueError with a clear message when
invariants fail and reference the fields index, total, start_char_idx,
end_char_idx, start_line, end_line.
- Around line 111-117: The model-level setting ChangeRecord.model_config
currently sets str_strip_whitespace=True which trims all str fields (including
text) during validation; remove or disable str_strip_whitespace from the
model_config and instead apply trimming only to specific fields if needed (e.g.,
add a field-specific validator or use a constrained field for any fields that
must be trimmed), ensuring ChangeRecord.text remains unmodified during
validation so offsets/spans stay consistent; update model_config and add a
targeted trim implementation referencing ChangeRecord, model_config, and the
text Field.
---
Nitpick comments:
In `@application/utils/noise_filter/regex_filter.py`:
- Around line 40-60: The class name RegexFilter is misleading because the
implementation uses glob-style matching (fnmatch); rename the class to a clearer
name (e.g., PatternFilter or GlobFilter) and update all references/imports and
tests accordingly: change the class declaration (RegexFilter -> PatternFilter),
update instantiation sites and any type annotations that reference RegexFilter
(including functions, tests, and other modules that import it), and ensure
exported names (if any) and the attribute patterns_path, deny_extensions,
deny_filenames, deny_paths, and allow_overrides remain unchanged so behavior and
YAML wiring continue to work.
In `@scripts/build_labeled_dataset.py`:
- Around line 520-528: The except block that catches Exception when reading the
GitHub rate limit (in the try/except around gh.get_rate_limit(), rl, core, and
the print of rate limits) should include the exception class name in the log
message for better debuggability; update the except clause to format the message
with both the exception type (e.g., using e.__class__.__name__ or
type(e).__name__) and the exception message so the printed line shows the error
class and details without changing control flow.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
Run ID: 9c395721-0b73-4b11-940a-981425b83662
📒 Files selected for processing (16)
application/tests/noise_filter/__init__.pyapplication/tests/noise_filter/fixtures/candidate_commits.jsonapplication/tests/noise_filter/fixtures/labeled_data.jsonapplication/tests/noise_filter/fixtures/module_a_mock.jsonlapplication/tests/noise_filter/regex_filter_test.pyapplication/tests/noise_filter/sanitize_test.pyapplication/tests/noise_filter/schemas_test.pyapplication/utils/noise_filter/__init__.pyapplication/utils/noise_filter/hashing.pyapplication/utils/noise_filter/noise_patterns.yamlapplication/utils/noise_filter/regex_filter.pyapplication/utils/noise_filter/sanitize.pyapplication/utils/noise_filter/schemas.pydocs/gsoc_2026_module_b/module_a_contract.schema.jsonscripts/build_labeled_dataset.pyscripts/label_dataset.py
Summary
<pre>tracking and hard-split cursor arithmetic).Builds on Week 1 (
module_b_w1, in review). Once Week 1 lands, this PR will rebase againstmainautomatically.Part of the GSoC 2026 OpenCRE Scraper & Indexer (Project OIE) Module B (Noise/Relevance Filter).
What this PR adds
Stage 1.5 — text sanitization (commit
77ad228)application/utils/noise_filter/sanitize.pysanitize_text(text)+strip_html(text). Defensive cleanup: PDF ligatures (ff → ff), zero-width chars, HTML entities/tags, broken hyphenation, NFC normalization. Idempotent on clean Module A output.application/tests/noise_filter/sanitize_test.pyVendored from rocklambros/TRACT (CC0 1.0 Universal), with one deliberate deviation: we do NOT collapse interior whitespace. TRACT's pipeline ends with
re.sub(r"\s+", " ", text)which flattens newlines too — that's right for embedding-similarity use cases but destroys structure Module A's contract explicitly preserves (rule 6: whitespace inside code fences). Module B's LLM benefits from paragraph breaks and code-fence layout, so we keep them intact. Also dropped TRACT'smax_length/return_fullmachinery andsanitize_control(dict)helper — both belong in their own domain, not B's.Stage 1 — path-based regex filter (commit
8f5169b)application/utils/noise_filter/regex_filter.pyRegexFilterclass. Public surface:is_noise_path(path),is_noise_record(rec),filter_records(records)lazy generator.application/utils/noise_filter/noise_patterns.yamlapplication/tests/noise_filter/regex_filter_test.pyConservative-by-design under the recall-first labeling rule (agreed with the maintainer on 2026-06-01, see the Week 1 PR thread). Stage 1 only blocks paths where we're highly confident no security content lives there. We deliberately do NOT block
**/blog/**,Website/content/**, or**/sponsors.mdeven though Week 1's labeling found them ~100% organizational — recall-first says let Stage 2 LLM judge content rather than blocking at the path level. We DO block**/Supporting Resources/meetings/**and**/Supporting Resources/enterprise metrics/**because Week 1 found those 100% organizational across all 25 SAMM samples.CodeRabbit Week 1 deferrals (commit
806b0e5)Addresses two CodeRabbit comments on the Week 1 PR that were deferred to Week 2 to land alongside other harvester work:
<pre>block depth in addition to```fences, so a heading-like line inside<pre>...</pre>no longer triggers a false split._split_chunk_by_sizecursor arithmetic now records per-entry separator width (0 for hard-split fragments, 2 for normal\n\nboundaries) instead of unconditionally adding+2between every consecutive sub-chunk.Empirical check on the existing Week 1 dataset: 0/100 chunks contain
<pre>tags and 0/100 are near the 4000-char hard-split ceiling. So the existinglabeled_data.jsonandcandidate_commits.jsonare unchanged — both old and new chunker produce bit-identical output on our sample. The fixes are forward-looking, protecting future harvests from corner-case files we didn't happen to sample.Test plan
make test— 312 passing, 1 skip, 0 failures, 0 errors (was 271 after Week 1; +41 from Week 2: 15 regex_filter + 26 sanitize).black --check .— clean across the whole repo (164 files unchanged, 0 to reformat). Pinned 24.4.2, matching superlinter.<pre>tracking prevents false splits and hard-split sub-chunks are contiguous (no phantom +2).Notes for reviewers
PromptHandler(LiteLLM-backed) — Week 3 deliverable.KnowledgeQueueItemSQLAlchemy model + Alembic migration — Week 5 deliverable.module_b_w1so thatChangeRecordfrom Week 1 is available. PR targetsmaindirectly; GitHub will rebase once Week 1 lands.Note :- Claude is used for some parts of the PR and file changes.