Tidy WiGiTS Outputs

tidywigits

tidywigits is an R package for parsing and tidying output from the WiGiTS/hmftools suite of genome and transcriptome analysis tools.

The WiGiTS pipeline produces hundreds of files per sample across dozens of tools, but consuming them downstream is fragile: formats are inconsistent, column names span a mix of conventions (e.g. snake_case, camelCase, dot.separated), some sub-tables are embedded in wide files, and column layouts change between tool versions.

tidywigits addresses this with a schema-driven parsing layer built on the nemo base R6 classes, supplying WiGiTS-specific schemas and parsers that turn raw outputs into consistently structured, versioned, analysis-ready tables that can be written to a variety of formats such as Apache Parquet, PostgreSQL, TSV, or RDS. Each run also produces a metadata.parquet file alongside the tidy tables, capturing IDs, paths, and R package versions.

Documentation

Installation: https://tidywf.github.io/tidywigits/articles/installation
Quickstart: https://tidywf.github.io/tidywigits/articles/quickstart
Files supported: https://tidywf.github.io/tidywigits/articles/schema_table
Structure: https://tidywf.github.io/tidywigits/articles/structure
Changelog: https://tidywf.github.io/tidywigits/articles/NEWS
R6: https://tidywf.github.io/nemo/articles/structure

Quickstart

Single tool

Some WiGiTS output files have non-standard layouts. For a very simple example, let’s look at PURPLE’s purple.qc file that stores QC metrics as key-value rows rather than columns:

indir_ppl <- system.file("extdata/oa/purple", package = "tidywigits")
writeLines(readLines(file.path(indir_ppl, "sample1.purple.qc")))
#> QCStatus PASS
#> Method   NORMAL
#> CopyNumberSegments   472
#> UnsupportedCopyNumberSegments    5
#> Purity   1.0000
#> AmberGender  FEMALE
#> CobaltGender FEMALE
#> DeletedGenes 0
#> Contamination    0.0
#> GermlineAberrations  NONE
#> AmberMeanDepth   79
#> LohPercent   0.0270
#> TincLevel    0.0000
#> ChimerismPercentage  0.0000

We can utilise the Purple class to parse, tidy and write PURPLE files in one call via its run() method:

outdir_ppl <- file.path(tempdir(), "ppl_out")
# fmt: skip
ppl <- Purple$new(indir_ppl)
ppl$run(
  output_dir = outdir_ppl,
  format = "parquet",
  input_id = "run1"
)
list.files(outdir_ppl, pattern = "\\.parquet$")
#>  [1] "metadata_purple.parquet"                       
#>  [2] "sample1_2_purple_cnvgenetsv.parquet"           
#>  [3] "sample1_2_purple_qc.parquet"                   
#>  [4] "sample1_2_somatic_purple_drivercatalog.parquet"
#>  [5] "sample1_germline_purple_drivercatalog.parquet" 
#>  [6] "sample1_purple_cnvgenetsv.parquet"             
#>  [7] "sample1_purple_cnvsomtsv.parquet"              
#>  [8] "sample1_purple_germdeltsv.parquet"             
#>  [9] "sample1_purple_purityrange.parquet"            
#> [10] "sample1_purple_puritytsv.parquet"              
#> [11] "sample1_purple_qc.parquet"                     
#> [12] "sample1_purple_somclonality.parquet"           
#> [13] "sample1_purple_somhist.parquet"                
#> [14] "version_2_purple_version.parquet"              
#> [15] "version_purple_version.parquet"

Now read back the tidied table:

qc_file <- list.files(outdir_ppl, pattern = "sample1_purple_qc.parquet", full.names = TRUE)
arrow::read_parquet(qc_file) |> str()
#> tibble [1 × 15] (S3: tbl_df/tbl/data.frame)
#>  $ input_id               : chr "run1"
#>  $ qc_status              : chr "PASS"
#>  $ method                 : chr "NORMAL"
#>  $ cn_segments            : int 472
#>  $ cn_segments_unsupported: int 5
#>  $ purity                 : num 1
#>  $ gender_amber           : chr "FEMALE"
#>  $ gender_cobalt          : chr "FEMALE"
#>  $ deleted_genes          : int 0
#>  $ contamination          : num 0
#>  $ germline_aberrations   : chr "NONE"
#>  $ mean_depth_amber       : num 79
#>  $ loh_percent            : num 0.027
#>  $ tinc_level             : num 0
#>  $ chimerism_percent      : num 0
#>  - attr(*, "file_version")= chr "latest"

Full WiGiTS

Files from the full WiGiTS suite can be processed with the convenient Wigits class. The starting point is a parent directory with WiGiTS results, and we can again utilise the run() method:

View input files

indir_w <- system.file("extdata/oa", package = "tidywigits")
dir_tree(indir_w, invert = TRUE, glob = "*.dvc")
/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa
├── alignments
│   ├── sample1.duplicate_freq.tsv
│   ├── sample1.md.metrics
│   └── sample1.redux.duplicate_freq.tsv
├── amber
│   ├── sample1.amber.baf.pcf
│   ├── sample1.amber.contamination.tsv
│   ├── sample1.amber.homozygousregion.tsv
│   └── sample1.amber.qc
├── bamtools
│   ├── sample1.bam_metric.coverage.tsv
│   ├── sample1.bam_metric.exon_medians.tsv
│   ├── sample1.bam_metric.flag_counts.tsv
│   ├── sample1.bam_metric.frag_length.tsv
│   ├── sample1.bam_metric.gene_coverage.tsv
│   ├── sample1.bam_metric.partition_stats.tsv
│   ├── sample1.bam_metric.summary.tsv
│   └── sample1.wgsmetrics
├── chord
│   ├── sample1.chord.mutation_contexts.tsv
│   └── sample1.chord.prediction.tsv
├── cider
│   ├── sample1.cider.blastn_match.tsv.gz
│   ├── sample1.cider.locus_stats.tsv
│   └── sample1.cider.vdj.tsv.gz
├── cobalt
│   ├── cobalt.version
│   ├── sample1.cobalt.gc.median.tsv
│   ├── sample1.cobalt.ratio.median.tsv
│   └── sample1.cobalt.ratio.pcf
├── cuppa
│   ├── sample1.cup.data.csv
│   ├── sample1.cuppa.pred_summ.tsv
│   ├── sample1.cuppa.vis_data.tsv
│   ├── sample1.cuppa_data.tsv.gz
│   └── v1.4
│       └── sample1.cup.data.csv
├── esvee
│   ├── sample1.esvee.alignment.tsv
│   ├── sample1.esvee.assembly.tsv
│   ├── sample1.esvee.breakend.tsv
│   ├── sample1.esvee.phased_assembly.tsv
│   ├── sample1.esvee.prep.disc_stats.tsv
│   ├── sample1.esvee.prep.fragment_length.tsv
│   └── sample1.esvee.prep.junction.tsv
├── flagstats
│   └── sample1.flagstat
├── isofox
│   ├── sample1.isf.alt_splice_junc.csv
│   ├── sample1.isf.fusions.csv
│   ├── sample1.isf.gene_collection.csv
│   ├── sample1.isf.gene_data.csv
│   ├── sample1.isf.pass_fusions.csv
│   ├── sample1.isf.retained_intron.csv
│   ├── sample1.isf.summary.csv
│   └── sample1.isf.transcript_data.csv
├── lilac
│   ├── sample1.lilac.candidates.coverage.tsv
│   ├── sample1.lilac.qc.tsv
│   └── sample1.lilac.tsv
├── linx
│   ├── germline_annotations
│   │   ├── linx.version
│   │   ├── sample1.linx.germline.breakend.tsv
│   │   ├── sample1.linx.germline.clusters.tsv
│   │   ├── sample1.linx.germline.driver.catalog.tsv
│   │   ├── sample1.linx.germline.links.tsv
│   │   └── sample1.linx.germline.svs.tsv
│   ├── somatic_annotations
│   │   ├── linx.version
│   │   ├── sample1.linx.breakend.tsv
│   │   ├── sample1.linx.clusters.tsv
│   │   ├── sample1.linx.driver.catalog.tsv
│   │   ├── sample1.linx.drivers.tsv
│   │   ├── sample1.linx.fusion.tsv
│   │   ├── sample1.linx.links.tsv
│   │   ├── sample1.linx.svs.tsv
│   │   ├── sample1.linx.vis_copy_number.tsv
│   │   ├── sample1.linx.vis_fusion.tsv
│   │   ├── sample1.linx.vis_gene_exon.tsv
│   │   ├── sample1.linx.vis_protein_domain.tsv
│   │   ├── sample1.linx.vis_segments.tsv
│   │   └── sample1.linx.vis_sv_data.tsv
│   └── v1.25
│       ├── germline_annotations
│       │   ├── linx.version
│       │   └── sample1.linx.germline.breakend.tsv
│       └── somatic_annotations
│           ├── linx.version
│           ├── sample1.linx.breakend.tsv
│           ├── sample1.linx.vis_copy_number.tsv
│           ├── sample1.linx.vis_fusion.tsv
│           ├── sample1.linx.vis_gene_exon.tsv
│           ├── sample1.linx.vis_protein_domain.tsv
│           ├── sample1.linx.vis_segments.tsv
│           └── sample1.linx.vis_sv_data.tsv
├── neo
│   ├── sample1.neo.neo_data.tsv
│   └── sample1.neo.neoepitope.tsv
├── peach
│   ├── sample1.peach.events.tsv
│   ├── sample1.peach.gene.events.tsv
│   ├── sample1.peach.haplotypes.all.tsv
│   ├── sample1.peach.haplotypes.best.tsv
│   └── sample1.peach.qc.tsv
├── purple
│   ├── purple.version
│   ├── sample1.purple.cnv.gene.tsv
│   ├── sample1.purple.cnv.somatic.tsv
│   ├── sample1.purple.driver.catalog.germline.tsv
│   ├── sample1.purple.driver.catalog.somatic.tsv
│   ├── sample1.purple.germline.deletion.tsv
│   ├── sample1.purple.purity.range.tsv
│   ├── sample1.purple.purity.tsv
│   ├── sample1.purple.qc
│   ├── sample1.purple.somatic.clonality.tsv
│   ├── sample1.purple.somatic.hist.tsv
│   └── v4.0
│       ├── purple.version
│       ├── sample1.purple.cnv.gene.tsv
│       └── sample1.purple.qc
├── sage
│   ├── germline
│   │   ├── sample1.sage.bqr.tsv
│   │   ├── sample2.sage.bqr.tsv
│   │   ├── sample2.sage.exon.medians.tsv
│   │   └── sample2.sage.gene.coverage.tsv
│   ├── somatic
│   │   ├── sample1.sage.bqr.tsv
│   │   ├── sample1.sage.exon.medians.tsv
│   │   ├── sample1.sage.gene.coverage.tsv
│   │   └── sample2.sage.bqr.tsv
│   └── v3.4.4
│       └── sample1.sage.bqr.tsv
├── sigs
│   ├── sample1.sig.allocation.tsv
│   └── sample1.sig.snv_counts.csv
├── teal
│   ├── sample1.teal.breakend.tsv.gz
│   └── sample1.teal.tellength.tsv
├── virusbreakend
│   └── sample1.virusbreakend.vcf.summary.tsv
└── virusinterpreter
    └── sample1.virus.annotated.tsv

We can parse, tidy up, and write the WiGiTS results into e.g. Parquet format or a PostgreSQL database as follows:

Parquet:

outdir_w <- file.path(tempdir(), "wigits_out_parquet")
w <- Wigits$new(indir_w)
res <- w$run(
  output_dir = outdir_w,
  format = "parquet",
  input_id = "run1",
  output_id = "out1",
  prefix_include = TRUE
)
res # shows summary of Wigits object
#> #--- Workflow Wigits ---#
#> 
#> |var           |value                                                                     |
#> |:-------------|:-------------------------------------------------------------------------|
#> |name          |Wigits                                                                    |
#> |path          |/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa |
#> |ntools        |19                                                                        |
#> |files_total   |228                                                                       |
#> |files_matched |112                                                                       |
#> |tidied        |true                                                                      |
#> |written       |true                                                                      |
list.files(outdir_w, pattern = "\\.parquet$") |> str()
#>  chr [1:120] "metadata.parquet" "sample1_2_2_linx_breakends.parquet" ...

PostgreSQL (adjust dbname/user for your purposes):

w2 <- Wigits$new(indir_w)
dbconn <- DBI::dbConnect(
  drv = RPostgres::Postgres(),
  dbname = "tidywigits",
  user = "me"
)
res <- w2$run(
  format = "db",
  input_id = "run2",
  output_id = "out2",
  prefix_include = TRUE,
  dbconn = dbconn
)

Note: Support for VCFs is a work in progress.

Three optional columns can be prepended to every written table to support downstream tracing and joining. All are opt-in and off by default, but highly recommended for any multi-sample or multi-run pipeline:

Column	Purpose	User-supplied or auto-generated?
`input_id`	identifies the sample or input run	user
`output_id`	identifies the tidywigits processing run	user or auto (ULID)
`input_prefix`	filename prefix (e.g. sample name)	auto

Installation

Using {remotes} directly from GitHub:

install.packages("remotes")
remotes::install_github("tidywf/tidywigits") # latest main commit
remotes::install_github("tidywf/tidywigits@v0.0.7.9006") # specific version

Alternatively:

conda package: https://anaconda.org/tidywf/r-tidywigits
Docker image: https://github.com/tidywf/tidywigits/pkgs/container/tidywigits

For more details see: https://tidywf.github.io/tidywigits/articles/installation

CLI

A tidywigits.R command line interface is available for convenience.

If you’re using the conda package, the tidywigits.R command will already be available inside the activated conda environment.
If you’re not using the conda package, you need to export the tidywigits/inst/cli/ directory to your PATH in order to use tidywigits.R.

tw_cli=$(Rscript -e 'x = system.file("cli", package = "tidywigits"); cat(x, "\n")' | xargs)
export PATH="${tw_cli}:${PATH}"

$ tidywigits.R --version
tidywigits 0.0.7.9006

#-----------------------------------#
$ tidywigits.R --help
usage: tidywigits.R [-h] [-v] {tidy,list} ...

✨ WiGiTS Output Tidying ✨

positional arguments:
  {tidy,list}    sub-command help
    tidy         Tidy Workflow Outputs
    list         List Parsable Workflow Outputs

options:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
'
#-----------------------------------#
#------- Tidy ----------------------#
$ tidywigits.R tidy --help
usage: tidywigits.R tidy [-h] -d IN_DIR [-o OUTPUT_DIR] [-f FORMAT]
                         [--input_id INPUT_ID] [--output_id OUTPUT_ID |
                         --ulid] [--dbname DBNAME] [--dbuser DBUSER]
                         [--include INCLUDE] [--exclude EXCLUDE]
                         [--prefix_include] [-q]

options:
  -h, --help            show this help message and exit
  -d, --in_dir IN_DIR   Input directory.
  -o, --output_dir OUTPUT_DIR
                        Output directory.
  -f, --format FORMAT   Format of output [def: parquet] (parquet, db, tsv,
                        csv, rds)
  --input_id INPUT_ID   Input ID for this run.
  --output_id OUTPUT_ID
                        Output ID for this run.
  --ulid                Generate a ULID as output ID.
  --dbname DBNAME       Database name.
  --dbuser DBUSER       Database user.
  --include INCLUDE     Include only these files (comma sep tool_parsers).
  --exclude EXCLUDE     Exclude only these files (comma sep tool_parsers).
  --prefix_include      Include input prefix column in output tables.
  -q, --quiet           Shush all the logs.

#-----------------------------------#
#------- List ----------------------#
$ tidywigits.R list --help
usage: tidywigits.R list [-h] -d IN_DIR [-f FORMAT] [-m MAX] [-q]

options:
  -h, --help           show this help message and exit
  -d, --in_dir IN_DIR  Input directory.
  -f, --format FORMAT  Format of list output [def: pretty] (tsv, pretty)
  -m, --max MAX        Max rows to show.
  -q, --quiet          Shush all the logs.

Name		Name	Last commit message	Last commit date
Latest commit History 561 Commits
.claude		.claude
.dvc		.dvc
.github		.github
.vscode		.vscode
R		R
deploy/conda		deploy/conda
inst		inst
man		man
pkgdown		pkgdown
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.bumpversion.toml		.bumpversion.toml
.dockerignore		.dockerignore
.gitignore		.gitignore
.lintr		.lintr
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
Dockerfile		Dockerfile
LICENSE		LICENSE
LICENSE.md		LICENSE.md
Makefile		Makefile
NAMESPACE		NAMESPACE
README.md		README.md
README.qmd		README.qmd
air.toml		air.toml
config		config

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tidy WiGiTS Outputs

Contents

tidywigits

Documentation

Quickstart

Single tool

Full WiGiTS

Installation

CLI

About

Licenses found

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Tidy WiGiTS Outputs

Contents

tidywigits

Documentation

Quickstart

Single tool

Full WiGiTS

Installation

CLI

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages