Skip to content

tidywf/tidywigits

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

561 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

logo

Tidy WiGiTS Outputs

conda-latest1 gha ghcr-latest

Contents

tidywigits

tidywigits is an R package for parsing and tidying output from the WiGiTS/hmftools suite of genome and transcriptome analysis tools.

The WiGiTS pipeline produces hundreds of files per sample across dozens of tools, but consuming them downstream is fragile: formats are inconsistent, column names span a mix of conventions (e.g.Β snake_case, camelCase, dot.separated), some sub-tables are embedded in wide files, and column layouts change between tool versions.

tidywigits addresses this with a schema-driven parsing layer built on the nemo base R6 classes, supplying WiGiTS-specific schemas and parsers that turn raw outputs into consistently structured, versioned, analysis-ready tables that can be written to a variety of formats such as Apache Parquet, PostgreSQL, TSV, or RDS. Each run also produces a metadata.parquet file alongside the tidy tables, capturing IDs, paths, and R package versions.

Documentation

Quickstart

Single tool

Some WiGiTS output files have non-standard layouts. For a very simple example, let’s look at PURPLE’s purple.qc file that stores QC metrics as key-value rows rather than columns:

indir_ppl <- system.file("extdata/oa/purple", package = "tidywigits")
writeLines(readLines(file.path(indir_ppl, "sample1.purple.qc")))
#> QCStatus PASS
#> Method   NORMAL
#> CopyNumberSegments   472
#> UnsupportedCopyNumberSegments    5
#> Purity   1.0000
#> AmberGender  FEMALE
#> CobaltGender FEMALE
#> DeletedGenes 0
#> Contamination    0.0
#> GermlineAberrations  NONE
#> AmberMeanDepth   79
#> LohPercent   0.0270
#> TincLevel    0.0000
#> ChimerismPercentage  0.0000

We can utilise the Purple class to parse, tidy and write PURPLE files in one call via its run() method:

outdir_ppl <- file.path(tempdir(), "ppl_out")
# fmt: skip
ppl <- Purple$new(indir_ppl)
ppl$run(
  output_dir = outdir_ppl,
  format = "parquet",
  input_id = "run1"
)
list.files(outdir_ppl, pattern = "\\.parquet$")
#>  [1] "metadata_purple.parquet"                       
#>  [2] "sample1_2_purple_cnvgenetsv.parquet"           
#>  [3] "sample1_2_purple_qc.parquet"                   
#>  [4] "sample1_2_somatic_purple_drivercatalog.parquet"
#>  [5] "sample1_germline_purple_drivercatalog.parquet" 
#>  [6] "sample1_purple_cnvgenetsv.parquet"             
#>  [7] "sample1_purple_cnvsomtsv.parquet"              
#>  [8] "sample1_purple_germdeltsv.parquet"             
#>  [9] "sample1_purple_purityrange.parquet"            
#> [10] "sample1_purple_puritytsv.parquet"              
#> [11] "sample1_purple_qc.parquet"                     
#> [12] "sample1_purple_somclonality.parquet"           
#> [13] "sample1_purple_somhist.parquet"                
#> [14] "version_2_purple_version.parquet"              
#> [15] "version_purple_version.parquet"

Now read back the tidied table:

qc_file <- list.files(outdir_ppl, pattern = "sample1_purple_qc.parquet", full.names = TRUE)
arrow::read_parquet(qc_file) |> str()
#> tibble [1 Γ— 15] (S3: tbl_df/tbl/data.frame)
#>  $ input_id               : chr "run1"
#>  $ qc_status              : chr "PASS"
#>  $ method                 : chr "NORMAL"
#>  $ cn_segments            : int 472
#>  $ cn_segments_unsupported: int 5
#>  $ purity                 : num 1
#>  $ gender_amber           : chr "FEMALE"
#>  $ gender_cobalt          : chr "FEMALE"
#>  $ deleted_genes          : int 0
#>  $ contamination          : num 0
#>  $ germline_aberrations   : chr "NONE"
#>  $ mean_depth_amber       : num 79
#>  $ loh_percent            : num 0.027
#>  $ tinc_level             : num 0
#>  $ chimerism_percent      : num 0
#>  - attr(*, "file_version")= chr "latest"

Full WiGiTS

Files from the full WiGiTS suite can be processed with the convenient Wigits class. The starting point is a parent directory with WiGiTS results, and we can again utilise the run() method:

View input files
indir_w <- system.file("extdata/oa", package = "tidywigits")
dir_tree(indir_w, invert = TRUE, glob = "*.dvc")
/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa
β”œβ”€β”€ alignments
β”‚   β”œβ”€β”€ sample1.duplicate_freq.tsv
β”‚   β”œβ”€β”€ sample1.md.metrics
β”‚   └── sample1.redux.duplicate_freq.tsv
β”œβ”€β”€ amber
β”‚   β”œβ”€β”€ sample1.amber.baf.pcf
β”‚   β”œβ”€β”€ sample1.amber.contamination.tsv
β”‚   β”œβ”€β”€ sample1.amber.homozygousregion.tsv
β”‚   └── sample1.amber.qc
β”œβ”€β”€ bamtools
β”‚   β”œβ”€β”€ sample1.bam_metric.coverage.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.exon_medians.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.flag_counts.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.frag_length.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.gene_coverage.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.partition_stats.tsv
β”‚   β”œβ”€β”€ sample1.bam_metric.summary.tsv
β”‚   └── sample1.wgsmetrics
β”œβ”€β”€ chord
β”‚   β”œβ”€β”€ sample1.chord.mutation_contexts.tsv
β”‚   └── sample1.chord.prediction.tsv
β”œβ”€β”€ cider
β”‚   β”œβ”€β”€ sample1.cider.blastn_match.tsv.gz
β”‚   β”œβ”€β”€ sample1.cider.locus_stats.tsv
β”‚   └── sample1.cider.vdj.tsv.gz
β”œβ”€β”€ cobalt
β”‚   β”œβ”€β”€ cobalt.version
β”‚   β”œβ”€β”€ sample1.cobalt.gc.median.tsv
β”‚   β”œβ”€β”€ sample1.cobalt.ratio.median.tsv
β”‚   └── sample1.cobalt.ratio.pcf
β”œβ”€β”€ cuppa
β”‚   β”œβ”€β”€ sample1.cup.data.csv
β”‚   β”œβ”€β”€ sample1.cuppa.pred_summ.tsv
β”‚   β”œβ”€β”€ sample1.cuppa.vis_data.tsv
β”‚   β”œβ”€β”€ sample1.cuppa_data.tsv.gz
β”‚   └── v1.4
β”‚       └── sample1.cup.data.csv
β”œβ”€β”€ esvee
β”‚   β”œβ”€β”€ sample1.esvee.alignment.tsv
β”‚   β”œβ”€β”€ sample1.esvee.assembly.tsv
β”‚   β”œβ”€β”€ sample1.esvee.breakend.tsv
β”‚   β”œβ”€β”€ sample1.esvee.phased_assembly.tsv
β”‚   β”œβ”€β”€ sample1.esvee.prep.disc_stats.tsv
β”‚   β”œβ”€β”€ sample1.esvee.prep.fragment_length.tsv
β”‚   └── sample1.esvee.prep.junction.tsv
β”œβ”€β”€ flagstats
β”‚   └── sample1.flagstat
β”œβ”€β”€ isofox
β”‚   β”œβ”€β”€ sample1.isf.alt_splice_junc.csv
β”‚   β”œβ”€β”€ sample1.isf.fusions.csv
β”‚   β”œβ”€β”€ sample1.isf.gene_collection.csv
β”‚   β”œβ”€β”€ sample1.isf.gene_data.csv
β”‚   β”œβ”€β”€ sample1.isf.pass_fusions.csv
β”‚   β”œβ”€β”€ sample1.isf.retained_intron.csv
β”‚   β”œβ”€β”€ sample1.isf.summary.csv
β”‚   └── sample1.isf.transcript_data.csv
β”œβ”€β”€ lilac
β”‚   β”œβ”€β”€ sample1.lilac.candidates.coverage.tsv
β”‚   β”œβ”€β”€ sample1.lilac.qc.tsv
β”‚   └── sample1.lilac.tsv
β”œβ”€β”€ linx
β”‚   β”œβ”€β”€ germline_annotations
β”‚   β”‚   β”œβ”€β”€ linx.version
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.breakend.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.clusters.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.driver.catalog.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.germline.links.tsv
β”‚   β”‚   └── sample1.linx.germline.svs.tsv
β”‚   β”œβ”€β”€ somatic_annotations
β”‚   β”‚   β”œβ”€β”€ linx.version
β”‚   β”‚   β”œβ”€β”€ sample1.linx.breakend.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.clusters.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.driver.catalog.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.drivers.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.fusion.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.links.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.svs.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.vis_copy_number.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.vis_fusion.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.vis_gene_exon.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.vis_protein_domain.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.linx.vis_segments.tsv
β”‚   β”‚   └── sample1.linx.vis_sv_data.tsv
β”‚   └── v1.25
β”‚       β”œβ”€β”€ germline_annotations
β”‚       β”‚   β”œβ”€β”€ linx.version
β”‚       β”‚   └── sample1.linx.germline.breakend.tsv
β”‚       └── somatic_annotations
β”‚           β”œβ”€β”€ linx.version
β”‚           β”œβ”€β”€ sample1.linx.breakend.tsv
β”‚           β”œβ”€β”€ sample1.linx.vis_copy_number.tsv
β”‚           β”œβ”€β”€ sample1.linx.vis_fusion.tsv
β”‚           β”œβ”€β”€ sample1.linx.vis_gene_exon.tsv
β”‚           β”œβ”€β”€ sample1.linx.vis_protein_domain.tsv
β”‚           β”œβ”€β”€ sample1.linx.vis_segments.tsv
β”‚           └── sample1.linx.vis_sv_data.tsv
β”œβ”€β”€ neo
β”‚   β”œβ”€β”€ sample1.neo.neo_data.tsv
β”‚   └── sample1.neo.neoepitope.tsv
β”œβ”€β”€ peach
β”‚   β”œβ”€β”€ sample1.peach.events.tsv
β”‚   β”œβ”€β”€ sample1.peach.gene.events.tsv
β”‚   β”œβ”€β”€ sample1.peach.haplotypes.all.tsv
β”‚   β”œβ”€β”€ sample1.peach.haplotypes.best.tsv
β”‚   └── sample1.peach.qc.tsv
β”œβ”€β”€ purple
β”‚   β”œβ”€β”€ purple.version
β”‚   β”œβ”€β”€ sample1.purple.cnv.gene.tsv
β”‚   β”œβ”€β”€ sample1.purple.cnv.somatic.tsv
β”‚   β”œβ”€β”€ sample1.purple.driver.catalog.germline.tsv
β”‚   β”œβ”€β”€ sample1.purple.driver.catalog.somatic.tsv
β”‚   β”œβ”€β”€ sample1.purple.germline.deletion.tsv
β”‚   β”œβ”€β”€ sample1.purple.purity.range.tsv
β”‚   β”œβ”€β”€ sample1.purple.purity.tsv
β”‚   β”œβ”€β”€ sample1.purple.qc
β”‚   β”œβ”€β”€ sample1.purple.somatic.clonality.tsv
β”‚   β”œβ”€β”€ sample1.purple.somatic.hist.tsv
β”‚   └── v4.0
β”‚       β”œβ”€β”€ purple.version
β”‚       β”œβ”€β”€ sample1.purple.cnv.gene.tsv
β”‚       └── sample1.purple.qc
β”œβ”€β”€ sage
β”‚   β”œβ”€β”€ germline
β”‚   β”‚   β”œβ”€β”€ sample1.sage.bqr.tsv
β”‚   β”‚   β”œβ”€β”€ sample2.sage.bqr.tsv
β”‚   β”‚   β”œβ”€β”€ sample2.sage.exon.medians.tsv
β”‚   β”‚   └── sample2.sage.gene.coverage.tsv
β”‚   β”œβ”€β”€ somatic
β”‚   β”‚   β”œβ”€β”€ sample1.sage.bqr.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.sage.exon.medians.tsv
β”‚   β”‚   β”œβ”€β”€ sample1.sage.gene.coverage.tsv
β”‚   β”‚   └── sample2.sage.bqr.tsv
β”‚   └── v3.4.4
β”‚       └── sample1.sage.bqr.tsv
β”œβ”€β”€ sigs
β”‚   β”œβ”€β”€ sample1.sig.allocation.tsv
β”‚   └── sample1.sig.snv_counts.csv
β”œβ”€β”€ teal
β”‚   β”œβ”€β”€ sample1.teal.breakend.tsv.gz
β”‚   └── sample1.teal.tellength.tsv
β”œβ”€β”€ virusbreakend
β”‚   └── sample1.virusbreakend.vcf.summary.tsv
└── virusinterpreter
    └── sample1.virus.annotated.tsv

We can parse, tidy up, and write the WiGiTS results into e.g.Β Parquet format or a PostgreSQL database as follows:

  • Parquet:
outdir_w <- file.path(tempdir(), "wigits_out_parquet")
w <- Wigits$new(indir_w)
res <- w$run(
  output_dir = outdir_w,
  format = "parquet",
  input_id = "run1",
  output_id = "out1",
  prefix_include = TRUE
)
res # shows summary of Wigits object
#> #--- Workflow Wigits ---#
#> 
#> |var           |value                                                                     |
#> |:-------------|:-------------------------------------------------------------------------|
#> |name          |Wigits                                                                    |
#> |path          |/home/runner/miniconda3/envs/bump_env/lib/R/library/tidywigits/extdata/oa |
#> |ntools        |19                                                                        |
#> |files_total   |228                                                                       |
#> |files_matched |112                                                                       |
#> |tidied        |true                                                                      |
#> |written       |true                                                                      |
list.files(outdir_w, pattern = "\\.parquet$") |> str()
#>  chr [1:120] "metadata.parquet" "sample1_2_2_linx_breakends.parquet" ...
  • PostgreSQL (adjust dbname/user for your purposes):
w2 <- Wigits$new(indir_w)
dbconn <- DBI::dbConnect(
  drv = RPostgres::Postgres(),
  dbname = "tidywigits",
  user = "me"
)
res <- w2$run(
  format = "db",
  input_id = "run2",
  output_id = "out2",
  prefix_include = TRUE,
  dbconn = dbconn
)

Note: Support for VCFs is a work in progress.

Three optional columns can be prepended to every written table to support downstream tracing and joining. All are opt-in and off by default, but highly recommended for any multi-sample or multi-run pipeline:

Column Purpose User-supplied or auto-generated?
input_id identifies the sample or input run user
output_id identifies the tidywigits processing run user or auto (ULID)
input_prefix filename prefix (e.g.Β sample name) auto

Installation

Using {remotes} directly from GitHub:

install.packages("remotes")
remotes::install_github("tidywf/tidywigits") # latest main commit
remotes::install_github("tidywf/tidywigits@v0.0.7.9006") # specific version

Alternatively:

For more details see: https://tidywf.github.io/tidywigits/articles/installation

CLI

A tidywigits.R command line interface is available for convenience.

  • If you’re using the conda package, the tidywigits.R command will already be available inside the activated conda environment.
  • If you’re not using the conda package, you need to export the tidywigits/inst/cli/ directory to your PATH in order to use tidywigits.R.
tw_cli=$(Rscript -e 'x = system.file("cli", package = "tidywigits"); cat(x, "\n")' | xargs)
export PATH="${tw_cli}:${PATH}"
$ tidywigits.R --version
tidywigits 0.0.7.9006

#-----------------------------------#
$ tidywigits.R --help
usage: tidywigits.R [-h] [-v] {tidy,list} ...

✨ WiGiTS Output Tidying ✨

positional arguments:
  {tidy,list}    sub-command help
    tidy         Tidy Workflow Outputs
    list         List Parsable Workflow Outputs

options:
  -h, --help     show this help message and exit
  -v, --version  show program's version number and exit
'
#-----------------------------------#
#------- Tidy ----------------------#
$ tidywigits.R tidy --help
usage: tidywigits.R tidy [-h] -d IN_DIR [-o OUTPUT_DIR] [-f FORMAT]
                         [--input_id INPUT_ID] [--output_id OUTPUT_ID |
                         --ulid] [--dbname DBNAME] [--dbuser DBUSER]
                         [--include INCLUDE] [--exclude EXCLUDE]
                         [--prefix_include] [-q]

options:
  -h, --help            show this help message and exit
  -d, --in_dir IN_DIR   Input directory.
  -o, --output_dir OUTPUT_DIR
                        Output directory.
  -f, --format FORMAT   Format of output [def: parquet] (parquet, db, tsv,
                        csv, rds)
  --input_id INPUT_ID   Input ID for this run.
  --output_id OUTPUT_ID
                        Output ID for this run.
  --ulid                Generate a ULID as output ID.
  --dbname DBNAME       Database name.
  --dbuser DBUSER       Database user.
  --include INCLUDE     Include only these files (comma sep tool_parsers).
  --exclude EXCLUDE     Exclude only these files (comma sep tool_parsers).
  --prefix_include      Include input prefix column in output tables.
  -q, --quiet           Shush all the logs.

#-----------------------------------#
#------- List ----------------------#
$ tidywigits.R list --help
usage: tidywigits.R list [-h] -d IN_DIR [-f FORMAT] [-m MAX] [-q]

options:
  -h, --help           show this help message and exit
  -d, --in_dir IN_DIR  Input directory.
  -f, --format FORMAT  Format of list output [def: pretty] (tsv, pretty)
  -m, --max MAX        Max rows to show.
  -q, --quiet          Shush all the logs.

About

🧬 Tidy Hartwig WiGiTS pipeline outputs

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Code of conduct

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages