Docker image for spaCy POS tagging, lemmatization and dependency parsing with support for input and output in CoNLL-U format.
This is a slim, focused implementation extracted from sota-pos-lemmatizers, originally developed by José Angel Daza, following the same pattern as conllu-treetagger-docker.
- Multi-language support: Works with any spaCy model for 70+ languages
- CoNLL-U input/output: Reads and writes CoNLL-U format
- On-demand model fetching: Models are downloaded on first run and cached in
/local/models - GermaLemma integration: Enhanced lemmatization for German (optional, German models only)
- Morphological features: Extracts and formats morphological features in CoNLL-U format
- Dependency parsing: Optional dependency relations (HEAD/DEPREL columns)
- Flexible configuration: Environment variables for batch size, chunk size, timeouts, etc.
From Docker Hub
docker pull korap/conllu-spacygit clone https://github.com/KorAP/conllu-spacy-docker.git
cd conllu-spacy-docker
make# Default: German model with dependency parsing and GermaLemma
docker run --rm -i korap/conllu-spacy < input.conllu > output.conllu# Disable dependency parsing for faster processing
docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu# Use a smaller German model
docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu
# Use French model
docker run --rm -i korap/conllu-spacy -m fr_core_news_lg < input.conllu > output.conllu
# Use English model (disable GermaLemma for non-German)
docker run --rm -i korap/conllu-spacy -m en_core_web_lg -g < input.conllu > output.conlluTo avoid downloading the language model on every run, mount a local directory to /local/models:
chmod 777 /path/to/local/models
docker run --rm -i -v /path/to/local/models:/local/models korap/conllu-spacy < input.conllu > output.conlluThe first run will download the model to /path/to/local/models/, and subsequent runs will reuse it.
There are several ways to preload models before running the container:
# Preload the default model (de_core_news_lg)
./preload-models.sh
# Preload a specific model
./preload-models.sh de_core_news_sm
# Preload to a custom directory
./preload-models.sh de_core_news_lg /path/to/models
# Then run with the preloaded models
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu# Build an image with models pre-installed
docker build -f Dockerfile.with-models -t korap/conllu-spacy:with-models .
# Run without needing to mount volumes
docker run --rm -i korap/conllu-spacy:with-models < input.conllu > output.conlluEdit Dockerfile.with-models to include additional models (sm, md) by uncommenting the relevant lines.
# Create models directory
mkdir -p ./models
# Download using a temporary container
docker run --rm -v ./models:/models python:3.12-slim bash -c "
pip install -q spacy &&
python -m spacy download de_core_news_lg &&
python -c 'import spacy, shutil, site;
shutil.copytree(site.getsitepackages()[0] + \"/de_core_news_lg\", \"/models/de_core_news_lg\")'
"
# Use the preloaded model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllukorapxmltool, which includes korapxml2conllu as a shortcut, can be downloaded from https://github.com/KorAP/korapxmltool.
korapxml2conllu goe.zip | docker run --rm -i korap/conllu-spacykorapxmltool -A "docker run --rm -i korap/conllu-spacy" -t zip goe.zipUsage: docker run --rm -i korap/conllu-spacy [OPTIONS]
Options:
-h Display help message
-m MODEL Specify spaCy model (default: de_core_news_lg)
-L List available/installed models
-V Display spaCy version information
-d Disable dependency parsing (faster processing)
-g Disable GermaLemma (use spaCy lemmatizer only)
To check which version of conllu-spacy-docker and its components are installed:
docker run --rm korap/conllu-spacy -VExample output:
=== Version Information ===
conllu-spacy-docker version: 3.8.11-3
spaCy version: 3.8.11
GermaLemma version: 0.1.3
Python version: 3.12.1
You can customize processing behavior with environment variables:
docker run --rm -i \
-e SPACY_USE_DEPENDENCIES="False" \
-e SPACY_USE_GERMALEMMA="True" \
-e SPACY_CHUNK_SIZE="10000" \
-e SPACY_BATCH_SIZE="1000" \
-e SPACY_N_PROCESS="1" \
-e SPACY_PARSE_TIMEOUT="30" \
-e SPACY_MAX_SENTENCE_LENGTH="500" \
korap/conllu-spacy < input.conllu > output.conlluAvailable environment variables:
SPACY_USE_DEPENDENCIES: Enable/disable dependency parsing (default: "True")SPACY_USE_GERMALEMMA: Enable/disable GermaLemma (default: "True")SPACY_CHUNK_SIZE: Number of sentences to process per chunk (default: 20000)SPACY_BATCH_SIZE: Batch size for spaCy processing (default: 2000)SPACY_N_PROCESS: Number of processes (default: 10)SPACY_PARSE_TIMEOUT: Timeout for dependency parsing per sentence in seconds (default: 30)SPACY_MAX_SENTENCE_LENGTH: Maximum sentence length for dependency parsing in tokens (default: 500)
# Fast processing: disable dependency parsing
docker run --rm -i korap/conllu-spacy -d < input.conllu > output.conllu
# Use spaCy lemmatizer only (without GermaLemma)
docker run --rm -i korap/conllu-spacy -g < input.conllu > output.conllu
# Smaller model for faster download
docker run --rm -i korap/conllu-spacy -m de_core_news_sm < input.conllu > output.conllu
# Persistent model storage
docker run --rm -i -v ./models:/local/models korap/conllu-spacy < input.conllu > output.conlluList installed models:
docker run --rm -i korap/conllu-spacy -LOpen a shell within the container:
docker run --rm -it --entrypoint /bin/bash korap/conllu-spacyAny spaCy model can be specified with the -m option. Models will be downloaded automatically on first use.
spaCy provides trained models for 70+ languages. See spaCy Models for the complete list.
de_core_news_lg(default, 560MB) - Large model, best accuracyde_core_news_md(100MB) - Medium model, balancedde_core_news_sm(15MB) - Small model, fastest
# Use French small model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m fr_core_news_sm < input.conllufr_core_news_lg(560MB) - Large French modelfr_core_news_md(100MB) - Medium French modelfr_core_news_sm(15MB) - Small French model
# Use English model
docker run --rm -i -v ./models:/local/models korap/conllu-spacy -m en_core_web_lg < input.conlluen_core_web_lg(560MB) - Large English modelen_core_web_md(100MB) - Medium English modelen_core_web_sm(15MB) - Small English model
Note: GermaLemma integration only works with German models. For other languages, the standard spaCy lemmatizer is used (with -g flag to disable GermaLemma).
From the sota-pos-lemmatizers benchmarks on the TIGER corpus (50,472 sentences):
| Configuration | Lemma Acc | POS Acc | POS F1 | sents/sec |
|---|---|---|---|---|
| spaCy + GermaLemma | 90.98 | 99.07 | 95.84 | 1,230 |
| spaCy (without GermaLemma) | 85.33 | 99.07 | 95.84 | 1,577 |
Note: Disabling dependency parsing (-d flag) significantly improves processing speed while maintaining POS tagging and lemmatization quality.
The project consists of:
- Dockerfile: Multi-stage build for optimized image size
- docker-entrypoint.sh: Entry point script that handles model fetching and CLI argument parsing
- systems/parse_spacy_pipe.py: Main spaCy processing pipeline
- lib/CoNLL_Annotation.py: CoNLL-U format parsing and token classes
- my_utils/file_utils.py: File handling utilities for chunked processing
Based on the sota-pos-lemmatizers evaluation project, originally by José Angel Daza and Marc Kupietz, with contributions by Rebecca Wilm, follows the pattern established by conllu-treetagger-docker.
- spaCy: https://spacy.io/
- GermaLemma: https://github.com/WZBSocialScienceCenter/germalemma
This project's source code is licensed under the BSD 2-Clause License.
See, however, the licenses of the individual components:
- spaCy: MIT License
- GermaLemma: Apache 2.0 License