Bioinformatic pipelines produce a lot of output files, but consuming them downstream is harder than it should be:
- Format variety: tools write TSV, CSV and various other proprietary formats, often mixed within the same pipeline
- Non-standard structure: files may be transposed, headerless, or embed section labels alongside data, requiring custom parsing logic for each tool
- Messy column names: raw names are frequently uppercase, space-separated, dot-delimited, or otherwise non-standard; joining across tools requires manual renaming
- Schema drift: column names and file layouts change silently between tool versions, breaking downstream code with no clear signal of what changed
- No run-level provenance: it is hard to tell which output file came from which sample or processing run once files are collected into a shared directory
nemo is an R package that attempts to address these issues by providing a schema-driven parsing and tidying layer that turns raw pipeline outputs into consistently structured, versioned, analysis-ready tables.
Its R6 classes (Tool, Workflow, Config) form the base layer: given
a directory of bioinformatic results, they identify files by
YAML-defined schemas, reshape and rename columns to a consistent tidy
form, and write to a specified format (Apache Parquet, TSV, CSV, RDS, or
PostgreSQL). Each run also produces a metadata.parquet file alongside
the tidy tables, capturing IDs, paths, and package versions.
Downstream packages extend these base classes by supplying tool-specific schemas and parsers. tidywigits and tidydragen are example R packages that target the large number of outputs from the established bioinformatic pipelines WiGiTS/hmftools and Illumina DRAGEN, respectively.
- Installation: https://tidywf.github.io/nemo/articles/installation
- Files supported: https://tidywf.github.io/nemo/articles/schema_table
- Changelog: https://tidywf.github.io/nemo/articles/NEWS
- R6 structure: https://tidywf.github.io/nemo/articles/structure
- UML: https://tidywf.github.io/nemo/articles/uml
- CI/CD: https://tidywf.github.io/nemo/articles/cicd
Raw pipeline outputs often have non-standard layouts. For example, this file stores QC metrics as key-value rows rather than columns:
library(nemo)
path <- system.file("extdata/tool1/latest", package = "nemo")
writeLines(readLines(file.path(path, "sampleA.tool1.table3.tsv")))
#> SampleID sampleA
#> QCStatus Pass
#> TotalReads 10000
#> MappedReads 9500
#> UnmappedReads 500The run() method is able to filter, tidy, and write all tables of
interest in one call (Workflow1 is nemo’s built-in example workflow):
outdir <- file.path(tempdir(), "quickstart")
wf1 <- Workflow1$new(path = path)
wf1$run(
output_dir = outdir,
format = "parquet",
input_id = "run1",
output_id = "out1",
prefix_include = TRUE
)
list.files(outdir, pattern = "\\.parquet$")
#> [1] "metadata.parquet" "sampleA_tool1_table1.parquet" "sampleA_tool1_table2.parquet"
#> [4] "sampleA_tool1_table3.parquet" "sampleA_tool1_table4.parquet" "sampleA_tool1_table5.parquet"
#> [7] "sampleA_tool1_table6.parquet"Read back the tidied table:
arrow::read_parquet(file.path(outdir, "sampleA_tool1_table3.parquet"))
#> # A tibble: 1 × 8
#> input_id input_prefix output_id sample_id qcstatus reads_total reads_map reads_unmap
#> * <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 run1 sampleA out1 sampleA Pass 10000 9500 500Three optional columns can be prepended to every written table to support downstream tracing and joining. All are opt-in and off by default, but highly recommended for any multi-sample or multi-run pipeline:
| Column | Purpose | User-supplied or auto-generated? |
|---|---|---|
input_id |
identifies the sample or input run | user |
output_id |
identifies the tidywigits processing run | user or auto (ULID) |
input_prefix |
filename prefix (e.g. sample name) | auto |
Using {remotes} directly from GitHub:
install.packages("remotes")
remotes::install_github("tidywf/nemo") # latest main commit
remotes::install_github("tidywf/nemo@v0.0.3.9023") # specific versionAlternatively:
- conda package: https://anaconda.org/tidywf/r-nemo
For more details see: https://tidywf.github.io/nemo/articles/installation
A nemo.R command line interface is available for convenience.
- If you’re using the conda package, the
nemo.Rcommand will already be available inside the activated conda environment. - If you’re not using the conda package, you need to export the
nemo/inst/cli/directory to yourPATHin order to usenemo.R.
nemo_cli=$(Rscript -e 'x = system.file("cli", package = "nemo"); cat(x, "\n")' | xargs)
export PATH="${nemo_cli}:${PATH}"$ nemo.R --version
nemo 0.0.3.9023
#-----------------------------------#
$ nemo.R --help
usage: nemo.R [-h] [-v] {tidy,list} ...
Tidy Bioinformatic Workflows
positional arguments:
{tidy,list} sub-command help
tidy Tidy Workflow Outputs
list List Parsable Workflow Outputs
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
'
#-----------------------------------#
$ nemo.R tidy --help
usage: nemo.R tidy [-h] -w WORKFLOW -d IN_DIR [-o OUTPUT_DIR] [-f FORMAT]
[--input_id INPUT_ID] [--output_id OUTPUT_ID | --ulid]
[--dbname DBNAME] [--dbuser DBUSER] [--include INCLUDE]
[--exclude EXCLUDE] [--prefix_include] [-q]
options:
-h, --help show this help message and exit
-w, --workflow WORKFLOW
Workflow name.
-d, --in_dir IN_DIR Input directory.
-o, --output_dir OUTPUT_DIR
Output directory.
-f, --format FORMAT Format of output [def: parquet] (parquet, db, tsv,
csv, rds)
--input_id INPUT_ID Input ID for this run.
--output_id OUTPUT_ID
Output ID for this run.
--ulid Generate a ULID as output ID.
--dbname DBNAME Database name.
--dbuser DBUSER Database user.
--include INCLUDE Include only these files (comma sep tool_parsers).
--exclude EXCLUDE Exclude only these files (comma sep tool_parsers).
--prefix_include Include input prefix column in output tables.
-q, --quiet Shush all the logs.
#-----------------------------------#
$ nemo.R list --help
usage: nemo.R list [-h] -w WORKFLOW -d IN_DIR [-f FORMAT] [-m MAX] [-q]
options:
-h, --help show this help message and exit
-w, --workflow WORKFLOW
Workflow name.
-d, --in_dir IN_DIR Input directory.
-f, --format FORMAT Format of list output [def: pretty] (tsv, pretty)
-m, --max MAX Max rows to show.
-q, --quiet Shush all the logs.
