Skip to content

Rewrite SQLiteStorageProvider to use SQLAlchemy #1143

@vringar

Description

@vringar

Motivation

The current SQLiteStorageProvider uses raw sqlite3 with dynamic SQL generation, making it tightly coupled to SQLite. Rewriting it with SQLAlchemy would allow using PostgreSQL, MySQL, and other databases for crawl data storage — useful for larger deployments and integration with existing infrastructure.

Current State

SQLiteStorageProvider (openwpm/storage/sql_provider.py):

  • Raw sqlite3 module, no ORM
  • Dynamic INSERT INTO table (col1, ...) VALUES (?, ...) generation per record
  • Lossy type coercion: bytesstr, callablestr, dictjson.dumps
  • Silent error logging on constraint violations (OperationalError, IntegrityError, etc.)
  • No batching — one INSERT per record
  • No indexes defined in schema.sql
  • 15 tables defined in openwpm/storage/schema.sql

Schema files that must stay in sync:

  • openwpm/storage/schema.sql (SQLite DDL)
  • openwpm/storage/parquet_schema.py (Arrow/Parquet)
  • test/storage/test_values.py (test data)
  • docs/Schema-Documentation.md

Requirements

Must Have

  • SQLAlchemy ORM models matching the existing 15-table SQLite schema exactly
  • SQLAlchemyStorageProvider(StructuredStorageProvider) using SQLAlchemy Core or ORM
  • Support SQLAlchemy connection URLs (equivalent to JDBC URLs), e.g.:
    • sqlite:///path/to/crawl.db
    • postgresql://user:pass@localhost/openwpm
  • Schema compatibility: the SQLite schema produced must be identical to the current schema.sql (column names, types, constraints, foreign keys) unless there is strong reasoning to diverge
  • Tests against local PostgreSQL (can use pytest-postgresql or Docker fixture)
  • Existing parametrized storage tests pass with the new provider
  • Backward compatibility: SQLiteStorageProvider API surface unchanged for existing users

Nice to Have

  • Batch inserts for better write throughput
  • Connection pooling via SQLAlchemy engine
  • Index definitions for common query patterns
  • Async support via SQLAlchemy 2.0+ async engine

Out of Scope

  • Migrating Arrow/Parquet providers to SQLAlchemy
  • Changing the UnstructuredStorageProvider interface
  • Schema migrations tooling (Alembic) — can be a follow-up

Implementation Notes

Interface contract — the provider must implement:

class SQLAlchemyStorageProvider(StructuredStorageProvider):
    async def init(self) -> None: ...
    async def store_record(self, table: TableName, visit_id: VisitId, record: Dict[str, Any]) -> None: ...
    async def finalize_visit_id(self, visit_id: VisitId, interrupted: bool) -> Optional[Task]: ...
    async def flush_cache(self) -> None: ...
    async def shutdown(self) -> None: ...

Connection URL approach:
SQLAlchemy uses database URLs which are the Python equivalent of JDBC URLs:

dialect+driver://username:password@host:port/database

Suggested implementation approach:
Use an adversarial two-agent workflow: one agent builds the implementation, a second agent reviews to ensure the SQLite schema remains exactly as-is unless strong reasoning is provided for changes.

Related

  • Rewrite DataAggregators #561 — Rewrite DataAggregators (closed, partially addressed by current storage system)
  • Current providers: SQLiteStorageProvider, LocalArrowProvider, MemoryStructuredProvider

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions