chore: add AI agent knowledge bank [DATA-32873]#18
chore: add AI agent knowledge bank [DATA-32873]#18quocnguyendinh wants to merge 3 commits intomasterfrom
Conversation
Add CLAUDE.md and knowledge/ folder documenting architecture, design patterns, conventions, and component layers to enable AI coding agents to follow existing patterns when working on the codebase. Refs: DATA-32873 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move knowledge files from knowledge/ to .claude/rules/ so they are automatically loaded into every Claude Code session. Replace root CLAUDE.md with .claude/CLAUDE.md. Refs: DATA-32873 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds an AI-agent-oriented documentation “knowledge bank” under .claude/, intended to be auto-loaded in Claude Code sessions to capture Diffa’s architecture, conventions, and operational patterns in one place.
Changes:
- Added
.claude/CLAUDE.mdas the entrypoint and multiple.claude/rules/*.mdreference docs. - Documented key layers (CLI → managers → services → repositories), configuration resolution, DB access patterns, and testing patterns.
- Captured project conventions (naming, imports, error-handling, exit codes, dependencies).
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
.claude/CLAUDE.md |
Claude entrypoint doc referencing auto-loaded rules content. |
.claude/rules/architecture.md |
High-level architecture, layers, and end-to-end data-diff flow. |
.claude/rules/cli.md |
CLI surface area and command/option documentation. |
.claude/rules/configuration.md |
ConfigManager/DBConfig hierarchy and config precedence. |
.claude/rules/conventions.md |
Naming/import/type-hint/logging/error-handling conventions. |
.claude/rules/database-layer.md |
DB connection strategies, ORM/schema notes, query/upsert/locking patterns. |
.claude/rules/design-patterns.md |
Repository/service/manager and other design patterns used in the codebase. |
.claude/rules/testing.md |
Pytest setup, fixtures, mocking patterns, and what is/isn’t tested. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```python | ||
| # src/diffa/db/diffa_check.py | ||
| class DiffaCheckService: | ||
| def __init__(self, config_manager: ConfigManager): |
There was a problem hiding this comment.
In the DiffaCheckService example, self.config_manager is referenced but never assigned (and differs from the actual implementation which sets self.config_manager = config_manager). Update the snippet so it matches the real constructor wiring and doesn’t imply an undefined attribute.
| def __init__(self, config_manager: ConfigManager): | |
| def __init__(self, config_manager: ConfigManager): | |
| self.config_manager = config_manager |
| ```python | ||
| # src/diffa/managers/check_manager.py | ||
| class CheckManager: | ||
| def __init__(self, config_manager: ConfigManager): |
There was a problem hiding this comment.
In the CheckManager example, SourceTargetService(self.cm) / DiffaCheckService(self.cm) references self.cm but the snippet doesn’t show self.cm = config_manager (as in the real code). Adjust the snippet to avoid an undefined attribute and reflect the actual initialization order.
| def __init__(self, config_manager: ConfigManager): | |
| def __init__(self, config_manager: ConfigManager): | |
| self.cm = config_manager |
| self.conn = DiffaConnection(self.db_config.get_db_config()) | ||
|
|
||
| def get_latest_check(self, source_database, ...) -> Optional[dict]: ... | ||
| def get_invalid_checks(self, source_database, ...) -> List[DiffaCheckSchema]: ... |
There was a problem hiding this comment.
The Repository Pattern snippet says get_invalid_checks(...) -> List[DiffaCheckSchema], but the current implementation is a generator (it yields items) and returns serialized dicts via model_dump(). Please align the documented return type/behavior with the code to avoid misleading consumers of this doc.
| def get_invalid_checks(self, source_database, ...) -> List[DiffaCheckSchema]: ... | |
| def get_invalid_checks(self, source_database, ...) -> Iterator[dict]: ... |
| # Diffa | ||
|
|
||
| A CLI tool that compares row counts between a Source database and a Target database (data warehouse), per day, optionally by dimension columns. Results are persisted in a Diffa state database. | ||
|
|
||
| All architecture, design patterns, conventions, and component documentation is auto-loaded from `.claude/rules/`. |
There was a problem hiding this comment.
PR description mentions adding a root-level CLAUDE.md and a knowledge/ folder, but the change adds .claude/CLAUDE.md and .claude/rules/ instead. Either adjust the description or add/move the files so the implementation matches what reviewers should expect.
| ```python | ||
| def get_last_check_date(self) -> date: ... | ||
| def get_invalid_checks(self, ...) -> List[DiffaCheckSchema]: ... | ||
| def get_counts(self, last_check_date: date, invalid_check_dates: Iterable[date]) -> Iterable[CountCheck]: ... | ||
| ``` |
There was a problem hiding this comment.
In the Type Hints examples, SourceTargetService.get_counts(...) is shown as returning a single Iterable[CountCheck], but the actual API is consumed as a 2-tuple (source_counts, target_counts) (see CheckManager.compare_tables). Consider updating the example signature to reflect the tuple return type so readers don’t copy an incorrect contract.
| ### DiffaCheck | ||
|
|
||
| ``` | ||
| Table: diffa.diffa_checks | ||
| Columns: id (UUID PK), source_database, source_schema, source_table, | ||
| target_database, target_schema, target_table, check_date, | ||
| source_count, target_count, is_valid, diff_count, updated_at | ||
| ``` | ||
|
|
||
| ### DiffaCheckRun | ||
|
|
||
| ``` | ||
| Table: diffa.diffa_check_runs | ||
| Columns: run_id (UUID PK), source_database, source_schema, source_table, | ||
| target_database, target_schema, target_table, status, updated_at | ||
| Unique index: (source_*, target_*) WHERE status = 'RUNNING' | ||
| ``` |
There was a problem hiding this comment.
The “Table: diffa.diffa_checks / diffa.diffa_check_runs” column lists omit created_at, but Alembic migrations create both created_at and updated_at columns for these tables (e.g., src/diffa/migrations/versions/2f7730d11b26... and 9710c39cc078...). Either include created_at in the documented schema or explicitly state that created_at exists in the DB but is not mapped in data_models.py ORM classes.
Each rule file now has a changelog table at the bottom tracking date, PR, and description of changes for decision traceability. Refs: DATA-32873 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
CLAUDE.mdat project root (auto-loaded by Claude Code every session) with essential architecture, patterns, and conventions inlineknowledge/folder with detailed reference docs covering architecture, design patterns, conventions, database layer, configuration, testing, and CLIJira
DATA-32873
Test plan
CLAUDE.mdis loaded automatically in new Claude Code sessionsknowledge/files are accurate against current codebase🤖 Generated with Claude Code