tidywigits is an R package for parsing and tidying output from the WiGiTS/hmftools suite of genome and transcriptome analysis tools.
The WiGiTS pipeline produces hundreds of files per sample across dozens
of tools, but consuming them downstream is fragile: formats are
inconsistent, column names span a mix of conventions (e.g.Β snake_case,
camelCase, dot.separated), some sub-tables are embedded in wide
files, and column layouts change between tool versions.
tidywigits addresses this with a schema-driven parsing layer built on
the nemo base R6 classes,
supplying WiGiTS-specific schemas and parsers that turn raw outputs into
consistently structured, versioned, analysis-ready tables that can be
written to a variety of formats such as Apache Parquet, PostgreSQL, TSV,
or RDS. Each run also produces a metadata.parquet file alongside the
tidy tables, capturing IDs, paths, and R package versions.
- Installation: https://tidywf.github.io/tidywigits/articles/installation
- Quickstart: https://tidywf.github.io/tidywigits/articles/quickstart
- Files supported: https://tidywf.github.io/tidywigits/articles/schema_table
- Structure: https://tidywf.github.io/tidywigits/articles/structure
- Changelog: https://tidywf.github.io/tidywigits/articles/NEWS
- R6: https://tidywf.github.io/nemo/articles/structure
Some WiGiTS output files have non-standard layouts. For a very simple
example, letβs look at PURPLEβs purple.qc file that stores QC metrics
as key-value rows rather than columns:
indir_ppl <- system.file("extdata/oa/purple", package = "tidywigits")
writeLines(readLines(file.path(indir_ppl, "sample1.purple.qc")))
#> QCStatus PASS
#> Method NORMAL
#> CopyNumberSegments 472
#> UnsupportedCopyNumberSegments 5
#> Purity 1.0000
#> AmberGender FEMALE
#> CobaltGender FEMALE
#> DeletedGenes 0
#> Contamination 0.0
#> GermlineAberrations NONE
#> AmberMeanDepth 79
#> LohPercent 0.0270
#> TincLevel 0.0000
#> ChimerismPercentage 0.0000We can utilise the Purple class to parse, tidy and write PURPLE files
in one call via its run() method:
outdir_ppl <- file.path(tempdir(), "ppl_out")
# fmt: skip
ppl <- Purple$new(indir_ppl)
ppl$run(
output_dir = outdir_ppl,
format = "parquet",
input_id = "run1"
)
list.files(outdir_ppl, pattern = "\\.parquet$")
#> [1] "metadata_purple.parquet"
#> [2] "sample1_2_purple_cnvgenetsv.parquet"
#> [3] "sample1_2_purple_qc.parquet"
#> [4] "sample1_2_somatic_purple_drivercatalog.parquet"
#> [5] "sample1_germline_purple_drivercatalog.parquet"
#> [6] "sample1_purple_cnvgenetsv.parquet"
#> [7] "sample1_purple_cnvsomtsv.parquet"
#> [8] "sample1_purple_germdeltsv.parquet"
#> [9] "sample1_purple_purityrange.parquet"
#> [10] "sample1_purple_puritytsv.parquet"
#> [11] "sample1_purple_qc.parquet"
#> [12] "sample1_purple_somclonality.parquet"
#> [13] "sample1_purple_somhist.parquet"
#> [14] "version_2_purple_version.parquet"
#> [15] "version_purple_version.parquet"Now read back the tidied table:
qc_file <- list.files(outdir_ppl, pattern = "sample1_purple_qc.parquet", full.names = TRUE)
arrow::read_parquet(qc_file) |> str()
#> tibble [1 Γ 15] (S3: tbl_df/tbl/data.frame)
#> $ input_id : chr "run1"
#> $ qc_status : chr "PASS"
#> $ method : chr "NORMAL"
#> $ cn_segments : int 472
#> $ cn_segments_unsupported: int 5
#> $ purity : num 1
#> $ gender_amber : chr "FEMALE"
#> $ gender_cobalt : chr "FEMALE"
#> $ deleted_genes : int 0
#> $ contamination : num 0
#> $ germline_aberrations : chr "NONE"
#> $ mean_depth_amber : num 79
#> $ loh_percent : num 0.027
#> $ tinc_level : num 0
#> $ chimerism_percent : num 0
#> - attr(*, "file_version")= chr "latest"Files from the full WiGiTS suite can be processed with the convenient
Wigits class. The starting point is a parent directory with WiGiTS
results, and we can again utilise the run() method:
View input files
indir_w <- system.file("extdata/oa", package = "tidywigits")
dir_tree(indir_w, invert = TRUE, glob = "*.dvc")
/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa
βββ alignments
β βββ sample1.duplicate_freq.tsv
β βββ sample1.md.metrics
β βββ sample1.redux.duplicate_freq.tsv
βββ amber
β βββ sample1.amber.baf.pcf
β βββ sample1.amber.contamination.tsv
β βββ sample1.amber.homozygousregion.tsv
β βββ sample1.amber.qc
βββ bamtools
β βββ sample1.bam_metric.coverage.tsv
β βββ sample1.bam_metric.exon_medians.tsv
β βββ sample1.bam_metric.flag_counts.tsv
β βββ sample1.bam_metric.frag_length.tsv
β βββ sample1.bam_metric.gene_coverage.tsv
β βββ sample1.bam_metric.partition_stats.tsv
β βββ sample1.bam_metric.summary.tsv
β βββ sample1.wgsmetrics
βββ chord
β βββ sample1.chord.mutation_contexts.tsv
β βββ sample1.chord.prediction.tsv
βββ cider
β βββ sample1.cider.blastn_match.tsv.gz
β βββ sample1.cider.locus_stats.tsv
β βββ sample1.cider.vdj.tsv.gz
βββ cobalt
β βββ cobalt.version
β βββ sample1.cobalt.gc.median.tsv
β βββ sample1.cobalt.ratio.median.tsv
β βββ sample1.cobalt.ratio.pcf
βββ cuppa
β βββ sample1.cup.data.csv
β βββ sample1.cuppa.pred_summ.tsv
β βββ sample1.cuppa.vis_data.tsv
β βββ sample1.cuppa_data.tsv.gz
β βββ v1.4
β βββ sample1.cup.data.csv
βββ esvee
β βββ sample1.esvee.alignment.tsv
β βββ sample1.esvee.assembly.tsv
β βββ sample1.esvee.breakend.tsv
β βββ sample1.esvee.phased_assembly.tsv
β βββ sample1.esvee.prep.disc_stats.tsv
β βββ sample1.esvee.prep.fragment_length.tsv
β βββ sample1.esvee.prep.junction.tsv
βββ flagstats
β βββ sample1.flagstat
βββ isofox
β βββ sample1.isf.alt_splice_junc.csv
β βββ sample1.isf.fusions.csv
β βββ sample1.isf.gene_collection.csv
β βββ sample1.isf.gene_data.csv
β βββ sample1.isf.pass_fusions.csv
β βββ sample1.isf.retained_intron.csv
β βββ sample1.isf.summary.csv
β βββ sample1.isf.transcript_data.csv
βββ lilac
β βββ sample1.lilac.candidates.coverage.tsv
β βββ sample1.lilac.qc.tsv
β βββ sample1.lilac.tsv
βββ linx
β βββ germline_annotations
β β βββ linx.version
β β βββ sample1.linx.germline.breakend.tsv
β β βββ sample1.linx.germline.clusters.tsv
β β βββ sample1.linx.germline.driver.catalog.tsv
β β βββ sample1.linx.germline.links.tsv
β β βββ sample1.linx.germline.svs.tsv
β βββ somatic_annotations
β β βββ linx.version
β β βββ sample1.linx.breakend.tsv
β β βββ sample1.linx.clusters.tsv
β β βββ sample1.linx.driver.catalog.tsv
β β βββ sample1.linx.drivers.tsv
β β βββ sample1.linx.fusion.tsv
β β βββ sample1.linx.links.tsv
β β βββ sample1.linx.svs.tsv
β β βββ sample1.linx.vis_copy_number.tsv
β β βββ sample1.linx.vis_fusion.tsv
β β βββ sample1.linx.vis_gene_exon.tsv
β β βββ sample1.linx.vis_protein_domain.tsv
β β βββ sample1.linx.vis_segments.tsv
β β βββ sample1.linx.vis_sv_data.tsv
β βββ v1.25
β βββ germline_annotations
β β βββ linx.version
β β βββ sample1.linx.germline.breakend.tsv
β βββ somatic_annotations
β βββ linx.version
β βββ sample1.linx.breakend.tsv
β βββ sample1.linx.vis_copy_number.tsv
β βββ sample1.linx.vis_fusion.tsv
β βββ sample1.linx.vis_gene_exon.tsv
β βββ sample1.linx.vis_protein_domain.tsv
β βββ sample1.linx.vis_segments.tsv
β βββ sample1.linx.vis_sv_data.tsv
βββ neo
β βββ sample1.neo.neo_data.tsv
β βββ sample1.neo.neoepitope.tsv
βββ peach
β βββ sample1.peach.events.tsv
β βββ sample1.peach.gene.events.tsv
β βββ sample1.peach.haplotypes.all.tsv
β βββ sample1.peach.haplotypes.best.tsv
β βββ sample1.peach.qc.tsv
βββ purple
β βββ purple.version
β βββ sample1.purple.cnv.gene.tsv
β βββ sample1.purple.cnv.somatic.tsv
β βββ sample1.purple.driver.catalog.germline.tsv
β βββ sample1.purple.driver.catalog.somatic.tsv
β βββ sample1.purple.germline.deletion.tsv
β βββ sample1.purple.purity.range.tsv
β βββ sample1.purple.purity.tsv
β βββ sample1.purple.qc
β βββ sample1.purple.somatic.clonality.tsv
β βββ sample1.purple.somatic.hist.tsv
β βββ v4.0
β βββ purple.version
β βββ sample1.purple.cnv.gene.tsv
β βββ sample1.purple.qc
βββ sage
β βββ germline
β β βββ sample1.sage.bqr.tsv
β β βββ sample2.sage.bqr.tsv
β β βββ sample2.sage.exon.medians.tsv
β β βββ sample2.sage.gene.coverage.tsv
β βββ somatic
β β βββ sample1.sage.bqr.tsv
β β βββ sample1.sage.exon.medians.tsv
β β βββ sample1.sage.gene.coverage.tsv
β β βββ sample2.sage.bqr.tsv
β βββ v3.4.4
β βββ sample1.sage.bqr.tsv
βββ sigs
β βββ sample1.sig.allocation.tsv
β βββ sample1.sig.snv_counts.csv
βββ teal
β βββ sample1.teal.breakend.tsv.gz
β βββ sample1.teal.tellength.tsv
βββ virusbreakend
β βββ sample1.virusbreakend.vcf.summary.tsv
βββ virusinterpreter
βββ sample1.virus.annotated.tsvWe can parse, tidy up, and write the WiGiTS results into e.g.Β Parquet format or a PostgreSQL database as follows:
- Parquet:
outdir_w <- file.path(tempdir(), "wigits_out_parquet")
w <- Wigits$new(indir_w)
res <- w$run(
output_dir = outdir_w,
format = "parquet",
input_id = "run1",
output_id = "out1",
prefix_include = TRUE
)
res # shows summary of Wigits object
#> #--- Workflow Wigits ---#
#>
#> |var |value |
#> |:-------------|:-------------------------------------------------------------------------|
#> |name |Wigits |
#> |path |/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa |
#> |ntools |19 |
#> |files_total |228 |
#> |files_matched |112 |
#> |tidied |true |
#> |written |true |
list.files(outdir_w, pattern = "\\.parquet$") |> str()
#> chr [1:120] "metadata.parquet" "sample1_2_2_linx_breakends.parquet" ...- PostgreSQL (adjust dbname/user for your purposes):
w2 <- Wigits$new(indir_w)
dbconn <- DBI::dbConnect(
drv = RPostgres::Postgres(),
dbname = "tidywigits",
user = "me"
)
res <- w2$run(
format = "db",
input_id = "run2",
output_id = "out2",
prefix_include = TRUE,
dbconn = dbconn
)Note: Support for VCFs is a work in progress.
Three optional columns can be prepended to every written table to support downstream tracing and joining. All are opt-in and off by default, but highly recommended for any multi-sample or multi-run pipeline:
| Column | Purpose | User-supplied or auto-generated? |
|---|---|---|
input_id |
identifies the sample or input run | user |
output_id |
identifies the tidywigits processing run | user or auto (ULID) |
input_prefix |
filename prefix (e.g.Β sample name) | auto |
Using {remotes} directly from GitHub:
install.packages("remotes")
remotes::install_github("tidywf/tidywigits") # latest main commit
remotes::install_github("tidywf/tidywigits@v0.0.7.9006") # specific versionAlternatively:
- conda package: https://anaconda.org/tidywf/r-tidywigits
- Docker image: https://github.com/tidywf/tidywigits/pkgs/container/tidywigits
For more details see: https://tidywf.github.io/tidywigits/articles/installation
A tidywigits.R command line interface is available for convenience.
- If youβre using the conda package, the
tidywigits.Rcommand will already be available inside the activated conda environment. - If youβre not using the conda package, you need to export the
tidywigits/inst/cli/directory to yourPATHin order to usetidywigits.R.
tw_cli=$(Rscript -e 'x = system.file("cli", package = "tidywigits"); cat(x, "\n")' | xargs)
export PATH="${tw_cli}:${PATH}"$ tidywigits.R --version
tidywigits 0.0.7.9006
#-----------------------------------#
$ tidywigits.R --help
usage: tidywigits.R [-h] [-v] {tidy,list} ...
β¨ WiGiTS Output Tidying β¨
positional arguments:
{tidy,list} sub-command help
tidy Tidy Workflow Outputs
list List Parsable Workflow Outputs
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
'
#-----------------------------------#
#------- Tidy ----------------------#
$ tidywigits.R tidy --help
usage: tidywigits.R tidy [-h] -d IN_DIR [-o OUTPUT_DIR] [-f FORMAT]
[--input_id INPUT_ID] [--output_id OUTPUT_ID |
--ulid] [--dbname DBNAME] [--dbuser DBUSER]
[--include INCLUDE] [--exclude EXCLUDE]
[--prefix_include] [-q]
options:
-h, --help show this help message and exit
-d, --in_dir IN_DIR Input directory.
-o, --output_dir OUTPUT_DIR
Output directory.
-f, --format FORMAT Format of output [def: parquet] (parquet, db, tsv,
csv, rds)
--input_id INPUT_ID Input ID for this run.
--output_id OUTPUT_ID
Output ID for this run.
--ulid Generate a ULID as output ID.
--dbname DBNAME Database name.
--dbuser DBUSER Database user.
--include INCLUDE Include only these files (comma sep tool_parsers).
--exclude EXCLUDE Exclude only these files (comma sep tool_parsers).
--prefix_include Include input prefix column in output tables.
-q, --quiet Shush all the logs.
#-----------------------------------#
#------- List ----------------------#
$ tidywigits.R list --help
usage: tidywigits.R list [-h] -d IN_DIR [-f FORMAT] [-m MAX] [-q]
options:
-h, --help show this help message and exit
-d, --in_dir IN_DIR Input directory.
-f, --format FORMAT Format of list output [def: pretty] (tsv, pretty)
-m, --max MAX Max rows to show.
-q, --quiet Shush all the logs.

