Eye imaging dataset discovery pipeline. Discovers eye imaging datasets across 7 scientific data repositories (Zenodo, Figshare, Dryad, OSF, DataCite, Kaggle, NEI), inspects ZIP/TAR contents via HTTP Range requests, and classifies records using envision-classifier.
Part of the EyeACT project by the FAIR Data Innovations Hub.
This project uses mise to manage tool versions (Python 3.12, uv) and uv for dependency management.
Install mise by following the mise installation guide.
# Install Python 3.12 and uv (as specified in mise.toml)
mise install
# Install project dependencies
uv pip install -r requirements.txtRequirements: Python >= 3.10, envision-classifier (installed automatically)
# Full pipeline: scrape all repos -> classify -> post to portal
./automation.sh
# Run steps independently
./automation.sh scrape # Scrape only (saves to data/metadata/{source}/)
./automation.sh classify # Classify existing data (loads from data/metadata/)
./automation.sh post # Post results to portal only# Scrape and classify all 7 repositories
python -m envision
# Single repository
python -m envision --source dryad
# Scrape only (saves per-record JSON, no classification)
python -m envision --scrape-only
# Classify existing data (loads from data/metadata/, no scraping)
python -m envision --skip-scrape --results-dir ./results
# Classify + download files for records labelled EYE_IMAGING
python -m envision --source zenodo --skip-scrape --download
# Download across all sources at a tighter confidence gate
python -m envision --source all --download --download-threshold 0.95Files are only downloaded after a record is classified as EYE_IMAGING with
confidence ≥ --download-threshold (default 0.80). They land in
data/downloads/{source}/{source_id}/ alongside a per-record manifest.json.
See docs/downloading.md.
That's a separate tool — envision-eye-actionable.
It reads this repo's data/downloads/{source}/{id}/ layout by default and
produces HuggingFace-loadable ADDF v0.1.0 trees at data/actionable/{source}/{id}/.
pip install envision-eye-actionable
envision-conform --source zenodo # conforms every downloaded record
envision-conform --source zenodo --source-id 4521044 # one recordAll 7 scrapers follow the same pattern:
- Save per-record JSON to
data/metadata/{source}/as they scrape - Resume automatically on restart (skip already-scraped records)
- Proactive rate limiting (delay before every API call) + unlimited exponential backoff retries
- Shared search terms across 47 ophthalmology-specific queries
| Source | API | Rate Limit | Archive Inspection | Notes |
|---|---|---|---|---|
| Zenodo | REST + Elasticsearch | 2s/req | ZIP, TAR | AND-required queries, date-range pagination |
| Figshare | REST (POST) | 1s (0.3s with token) | ZIP, TAR | Set FIGSHARE_ACCESS_TOKEN in .env |
| DataCite | REST | 1s/req | N/A (metadata only) | Indexes DOIs across repositories |
| Kaggle | REST | 1s/req | ZIP, TAR | Requires KAGGLE_API_TOKEN |
| Dryad | REST | 1.5s/req | ZIP, TAR | Small corpus |
| NEI | NIH RePORTER (POST) | 1.5s/req | N/A (grants) | Eye-specific by definition |
| OSF | REST v2 search | 2s/req | ZIP, TAR | Set OSF_TOKEN in .env for higher limits |
Each scraper saves per-record JSON to data/metadata/{source}/{source_id}.json.
Classification results go to results/:
| File | Description |
|---|---|
{source}_eye_imaging.json |
Records classified as EYE_IMAGING, sorted by confidence |
{source}_all_results.json |
All classified records with binary labels |
Each classified record:
{
"source": "zenodo",
"source_id": "8254022",
"doi": "10.5281/zenodo.8254022",
"url": "https://zenodo.org/records/8254022",
"label": "EYE_IMAGING",
"confidence": 0.9998,
"prob_eye_imaging": 0.9998,
"prob_negative": 0.0002,
"title": "Dataset for PT-OCT ANN Project",
"description": "...",
"keywords": ["PT-OCT", "ANN"],
"access_type": "open",
"license": "cc-by-4.0",
"file_types": [".zip"],
"file_names": ["Data.zip"],
"file_count": 1,
"img_count": 0,
"medical_count": 0,
"archive_count": 1,
"genomics_count": 0,
"size_mb": 302.1,
"external_links": [],
"related_dois": []
}| Label | Description |
|---|---|
| EYE_IMAGING | Actual eye imaging datasets (fundus, OCT, OCTA, cornea, slit-lamp, anterior segment) |
| NEGATIVE | Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging) |
envision-discovery/
├── envision/
│ ├── __init__.py # Re-exports EyeImagingClassifier from envision-classifier
│ ├── __main__.py # python -m envision entry point
│ ├── cli.py # CLI (--source, --skip-scrape, --scrape-only, --download)
│ ├── scraper.py # Zenodo scraper with ZIP inspection + AND queries
│ ├── pipeline.py # Classification pipeline (downloads model from HuggingFace)
│ ├── downloader.py # Post-classification file downloader (gated by EYE_IMAGING)
│ ├── addf_export.py # ADDF metadata export (dataset_description + structure_description from the scrape)
│ ├── metadata.py # DatasetMetadata dataclass (save/load/resume)
│ ├── utils.py # Shared utilities (backoff, archive inspector, pagination)
│ └── scrapers/ # Per-source scrapers (all save to data/metadata/{source}/)
│ ├── datacite.py
│ ├── figshare.py
│ ├── kaggle.py
│ ├── dryad.py
│ ├── nei.py
│ └── osf.py
├── automation.sh # Weekly cron — scrape/classify/post (run steps independently)
├── data/
│ └── metadata/ # Per-record JSON files per source (not committed)
│ ├── zenodo/
│ ├── figshare/
│ ├── datacite/
│ ├── kaggle/
│ ├── dryad/
│ ├── nei/
│ └── osf/
├── results/ # Classification output (not committed)
├── .env.example # API tokens template (Figshare, OSF, Kaggle, Portal)
├── pyproject.toml
└── README.md
- envision-eye-actionable — turns this repo's downloads into HuggingFace-loadable ADDF v0.1.0 trees
- envision-classifier — The SetFit classifier package (
pip install envision-classifier) - Model weights on HuggingFace
MIT License. Individual dataset licenses vary — check each dataset before use.