ENVISION Discovery

Eye imaging dataset discovery pipeline. Discovers eye imaging datasets across 7 scientific data repositories (Zenodo, Figshare, Dryad, OSF, DataCite, Kaggle, NEI), inspects ZIP/TAR contents via HTTP Range requests, and classifies records using envision-classifier.

Part of the EyeACT project by the FAIR Data Innovations Hub.

Quick Start

This project uses mise to manage tool versions (Python 3.12, uv) and uv for dependency management.

Prerequisites

Install mise by following the mise installation guide.

Install dependencies

# Install Python 3.12 and uv (as specified in mise.toml)
mise install

# Install project dependencies
uv pip install -r requirements.txt

Requirements: Python >= 3.10, envision-classifier (installed automatically)

Usage

Automated pipeline

# Full pipeline: scrape all repos -> classify -> post to portal
./automation.sh

# Run steps independently
./automation.sh scrape      # Scrape only (saves to data/metadata/{source}/)
./automation.sh classify    # Classify existing data (loads from data/metadata/)
./automation.sh post        # Post results to portal only

CLI

# Scrape and classify all 7 repositories
python -m envision

# Single repository
python -m envision --source dryad

# Scrape only (saves per-record JSON, no classification)
python -m envision --scrape-only

# Classify existing data (loads from data/metadata/, no scraping)
python -m envision --skip-scrape --results-dir ./results

# Classify + download files for records labelled EYE_IMAGING
python -m envision --source zenodo --skip-scrape --download

# Download across all sources at a tighter confidence gate
python -m envision --source all --download --download-threshold 0.95

Files are only downloaded after a record is classified as EYE_IMAGING with confidence ≥ --download-threshold (default 0.80). They land in data/downloads/{source}/{source_id}/ alongside a per-record manifest.json. See docs/downloading.md.

Turning downloads into training-ready trees

That's a separate tool — envision-eye-actionable. It reads this repo's data/downloads/{source}/{id}/ layout by default and produces HuggingFace-loadable ADDF v0.1.0 trees at data/actionable/{source}/{id}/.

pip install envision-eye-actionable
envision-conform --source zenodo                  # conforms every downloaded record
envision-conform --source zenodo --source-id 4521044   # one record

Scrapers

All 7 scrapers follow the same pattern:

Save per-record JSON to data/metadata/{source}/ as they scrape
Resume automatically on restart (skip already-scraped records)
Proactive rate limiting (delay before every API call) + unlimited exponential backoff retries
Shared search terms across 47 ophthalmology-specific queries

Source	API	Rate Limit	Archive Inspection	Notes
Zenodo	REST + Elasticsearch	2s/req	ZIP, TAR	AND-required queries, date-range pagination
Figshare	REST (POST)	1s (0.3s with token)	ZIP, TAR	Set `FIGSHARE_ACCESS_TOKEN` in `.env`
DataCite	REST	1s/req	N/A (metadata only)	Indexes DOIs across repositories
Kaggle	REST	1s/req	ZIP, TAR	Requires `KAGGLE_API_TOKEN`
Dryad	REST	1.5s/req	ZIP, TAR	Small corpus
NEI	NIH RePORTER (POST)	1.5s/req	N/A (grants)	Eye-specific by definition
OSF	REST v2 search	2s/req	ZIP, TAR	Set `OSF_TOKEN` in `.env` for higher limits

Output

Each scraper saves per-record JSON to data/metadata/{source}/{source_id}.json.

Classification results go to results/:

File	Description
`{source}_eye_imaging.json`	Records classified as EYE_IMAGING, sorted by confidence
`{source}_all_results.json`	All classified records with binary labels

Each classified record:

{
  "source": "zenodo",
  "source_id": "8254022",
  "doi": "10.5281/zenodo.8254022",
  "url": "https://zenodo.org/records/8254022",
  "label": "EYE_IMAGING",
  "confidence": 0.9998,
  "prob_eye_imaging": 0.9998,
  "prob_negative": 0.0002,
  "title": "Dataset for PT-OCT ANN Project",
  "description": "...",
  "keywords": ["PT-OCT", "ANN"],
  "access_type": "open",
  "license": "cc-by-4.0",
  "file_types": [".zip"],
  "file_names": ["Data.zip"],
  "file_count": 1,
  "img_count": 0,
  "medical_count": 0,
  "archive_count": 1,
  "genomics_count": 0,
  "size_mb": 302.1,
  "external_links": [],
  "related_dois": []
}

Classification labels

Label	Description
EYE_IMAGING	Actual eye imaging datasets (fundus, OCT, OCTA, cornea, slit-lamp, anterior segment)
NEGATIVE	Everything else (non-eye data, software/code, eye-adjacent non-imaging, non-eye medical imaging)

Repository structure

envision-discovery/
├── envision/
│   ├── __init__.py         # Re-exports EyeImagingClassifier from envision-classifier
│   ├── __main__.py         # python -m envision entry point
│   ├── cli.py              # CLI (--source, --skip-scrape, --scrape-only, --download)
│   ├── scraper.py          # Zenodo scraper with ZIP inspection + AND queries
│   ├── pipeline.py         # Classification pipeline (downloads model from HuggingFace)
│   ├── downloader.py       # Post-classification file downloader (gated by EYE_IMAGING)
│   ├── addf_export.py      # ADDF metadata export (dataset_description + structure_description from the scrape)
│   ├── metadata.py         # DatasetMetadata dataclass (save/load/resume)
│   ├── utils.py            # Shared utilities (backoff, archive inspector, pagination)
│   └── scrapers/           # Per-source scrapers (all save to data/metadata/{source}/)
│       ├── datacite.py
│       ├── figshare.py
│       ├── kaggle.py
│       ├── dryad.py
│       ├── nei.py
│       └── osf.py
├── automation.sh           # Weekly cron — scrape/classify/post (run steps independently)
├── data/
│   └── metadata/           # Per-record JSON files per source (not committed)
│       ├── zenodo/
│       ├── figshare/
│       ├── datacite/
│       ├── kaggle/
│       ├── dryad/
│       ├── nei/
│       └── osf/
├── results/                # Classification output (not committed)
├── .env.example            # API tokens template (Figshare, OSF, Kaggle, Portal)
├── pyproject.toml
└── README.md

Related repositories

envision-eye-actionable — turns this repo's downloads into HuggingFace-loadable ADDF v0.1.0 trees
envision-classifier — The SetFit classifier package (pip install envision-classifier)
Model weights on HuggingFace

License

MIT License. Individual dataset licenses vary — check each dataset before use.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
docs		docs
envision		envision
eval		eval
paper		paper
results		results
runs		runs
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
.pydocstyle.ini		.pydocstyle.ini
.pylint.ini		.pylint.ini
ENVISION_Classifier_Overview.pptx		ENVISION_Classifier_Overview.pptx
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
add_dataset_records.py		add_dataset_records.py
automation.sh		automation.sh
mise.toml		mise.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
upload_to_hf.py		upload_to_hf.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ENVISION Discovery

Quick Start

Prerequisites

Install dependencies

Usage

Automated pipeline

CLI

Turning downloads into training-ready trees

Scrapers

Output

Classification labels

Repository structure

Related repositories

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ENVISION Discovery

Quick Start

Prerequisites

Install dependencies

Usage

Automated pipeline

CLI

Turning downloads into training-ready trees

Scrapers

Output

Classification labels

Repository structure

Related repositories

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages