You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current SQLiteStorageProvider uses raw sqlite3 with dynamic SQL generation, making it tightly coupled to SQLite. Rewriting it with SQLAlchemy would allow using PostgreSQL, MySQL, and other databases for crawl data storage — useful for larger deployments and integration with existing infrastructure.
Silent error logging on constraint violations (OperationalError, IntegrityError, etc.)
No batching — one INSERT per record
No indexes defined in schema.sql
15 tables defined in openwpm/storage/schema.sql
Schema files that must stay in sync:
openwpm/storage/schema.sql (SQLite DDL)
openwpm/storage/parquet_schema.py (Arrow/Parquet)
test/storage/test_values.py (test data)
docs/Schema-Documentation.md
Requirements
Must Have
SQLAlchemy ORM models matching the existing 15-table SQLite schema exactly
SQLAlchemyStorageProvider(StructuredStorageProvider) using SQLAlchemy Core or ORM
Support SQLAlchemy connection URLs (equivalent to JDBC URLs), e.g.:
sqlite:///path/to/crawl.db
postgresql://user:pass@localhost/openwpm
Schema compatibility: the SQLite schema produced must be identical to the current schema.sql (column names, types, constraints, foreign keys) unless there is strong reasoning to diverge
Tests against local PostgreSQL (can use pytest-postgresql or Docker fixture)
Existing parametrized storage tests pass with the new provider
Backward compatibility: SQLiteStorageProvider API surface unchanged for existing users
Nice to Have
Batch inserts for better write throughput
Connection pooling via SQLAlchemy engine
Index definitions for common query patterns
Async support via SQLAlchemy 2.0+ async engine
Out of Scope
Migrating Arrow/Parquet providers to SQLAlchemy
Changing the UnstructuredStorageProvider interface
Schema migrations tooling (Alembic) — can be a follow-up
Suggested implementation approach:
Use an adversarial two-agent workflow: one agent builds the implementation, a second agent reviews to ensure the SQLite schema remains exactly as-is unless strong reasoning is provided for changes.
Motivation
The current
SQLiteStorageProvideruses rawsqlite3with dynamic SQL generation, making it tightly coupled to SQLite. Rewriting it with SQLAlchemy would allow using PostgreSQL, MySQL, and other databases for crawl data storage — useful for larger deployments and integration with existing infrastructure.Current State
SQLiteStorageProvider(openwpm/storage/sql_provider.py):sqlite3module, no ORMINSERT INTO table (col1, ...) VALUES (?, ...)generation per recordbytes→str,callable→str,dict→json.dumpsschema.sqlopenwpm/storage/schema.sqlSchema files that must stay in sync:
openwpm/storage/schema.sql(SQLite DDL)openwpm/storage/parquet_schema.py(Arrow/Parquet)test/storage/test_values.py(test data)docs/Schema-Documentation.mdRequirements
Must Have
SQLAlchemyStorageProvider(StructuredStorageProvider)using SQLAlchemy Core or ORMsqlite:///path/to/crawl.dbpostgresql://user:pass@localhost/openwpmschema.sql(column names, types, constraints, foreign keys) unless there is strong reasoning to divergepytest-postgresqlor Docker fixture)SQLiteStorageProviderAPI surface unchanged for existing usersNice to Have
Out of Scope
UnstructuredStorageProviderinterfaceImplementation Notes
Interface contract — the provider must implement:
Connection URL approach:
SQLAlchemy uses database URLs which are the Python equivalent of JDBC URLs:
Suggested implementation approach:
Use an adversarial two-agent workflow: one agent builds the implementation, a second agent reviews to ensure the SQLite schema remains exactly as-is unless strong reasoning is provided for changes.
Related
SQLiteStorageProvider,LocalArrowProvider,MemoryStructuredProvider