PG-DIA

PG-DIA is a Nextflow DSL2 workflow for building customized protein databases from RNA-seq and searching matched DIA-MS data against those databases. This implementation combines RNA-seq variant calling, transcript assembly, novel isoform ORF prediction, protein database assembly, and novel peptides reporting in one pipeline.

The repository follows an nf-core-style layout, but the root README.md is the best high-level guide for the current Zhang Lab workflow.

What the pipeline does

For each sample, the workflow:

Accepts RNA-seq input as FASTQ, BAM, or CRAM together with one DIA raw file path.
Runs the RNA variant branch to produce BAMs and annotated VCFs.
Converts annotated variants into variant peptide FASTA entries with pypgatk, then annotates amino acid changes.
Runs StringTie on the RNA-seq alignment output.
Runs gffcompare, filters novel transcript models, extracts transcript FASTA with gffread, and predicts ORFs with TransDecoder.
Combines the reference proteome, variant peptides, and novel isoform peptides into per-sample protein databases.
Runs DIA-NN against the per-sample combined FASTA.
Post-processes the DIA-NN parquet report into reference and novel peptide/protein matrices.

This is a sample-matched workflow: each RNA-seq sample is paired to one DIA raw input and yields its own customized database and DIA-NN output.

Inputs

Samplesheet

The full PG-DIA workflow expects one row per sample with:

one RNA-seq input mode: fastq_1/fastq_2, or bam/bai, or cram/crai
one DIA input path in dia_raw

An example samplesheet for this branch looks like this:

sample,fastq_1,fastq_2,bam,bai,cram,crai,dia_raw,strandedness
SAMPLE_A,/data/rna/SAMPLE_A_R1.fastq.gz,/data/rna/SAMPLE_A_R2.fastq.gz,,,,,/data/dia/SAMPLE_A.d,unstranded
SAMPLE_B,,,/data/rna/SAMPLE_B.markdup.bam,/data/rna/SAMPLE_B.markdup.bam.bai,,,/data/dia/SAMPLE_B.RAW,unstranded
SAMPLE_C,,,,,/data/rna/SAMPLE_C.cram,/data/rna/SAMPLE_C.cram.crai,/data/dia/SAMPLE_C.d,unstranded

Column notes:

Column	Required	Description
`sample`	Yes	Sample identifier. Multiple FASTQ lanes for the same sample may reuse the same name.
`fastq_1`, `fastq_2`	Conditional	RNA-seq FASTQs. Use these if starting from raw reads.
`bam`, `bai`	Conditional	Coordinate-sorted BAM and index if alignment is already done.
`cram`, `crai`	Conditional	CRAM and CRAI if starting from CRAM instead of BAM.
`dia_raw`	Yes	DIA-MS raw path for the sample, for example a timsTOF `.d` directory or vendor raw file.
`strandedness`	Yes	Passed to StringTie. Supported values are `forward`, `reverse`, and `unstranded`.

Required reference and workflow parameters

At minimum, plan to provide:

--input
--outdir
--protein_reference_db
either --genome with a configured reference bundle, or explicit --fasta and --gtf
--read_length when aligning from FASTQ

For the RNA variant branch:

provide --dbsnp and/or --known_indels if you want base recalibration
otherwise set --skip_baserecalibration

Quick start

Run from the repository root:

nextflow run bzhanglab/pgdia \
  -profile docker \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  --read_length 151 \
  --protein_reference_db /path/to/reference_proteome.fa \ 
  --skip_baserecalibration

If you have known sites for GATK RNA recalibration, use them instead of skipping BQSR:

nextflow run bzhanglab/pgdia \
  -profile docker \
  --input samplesheet.csv \
  --outdir results \
  --genome GRCh38 \
  --read_length 151 \
  --protein_reference_db /path/to/reference_proteome.fa \
  --dbsnp /path/to/dbsnp.vcf.gz \
  --known_indels /path/to/known_indels.vcf.gz

Useful runtime options:

-profile docker, -profile singularity, -profile conda, or -profile mamba
-resume to continue from a previous run
--diann_image to point to a DIA-NN container image name or tarball
--diann_bin if the DIA-NN executable path inside the image differs from the default
--diann_cpus to control DIA-NN threads

Reference configuration

The bundled conf/igenomes.config defines specific --genome GRCh38 entry pointing to:

genome FASTA
annotation GTF
STAR index
VEP cache metadata

If you do not want to use that bundle, provide explicit reference files with parameters such as:

--fasta
--fasta_fai
--dict
--gtf
--star_index
--dbsnp
--known_indels

--read_length matters for STAR index generation and alignment. Set it to the actual read length of the RNA-seq data, for example 151 for 2x151 bp libraries.

Main outputs

Published results are written under --outdir and typically include:

Path	Contents
`reports/`	MultiQC output and annotation summary reports
`pipeline_info/`	Nextflow execution metadata, params, and software versions
`annotation/<sample>/`	Annotated and decompressed VCF outputs from the RNA variant branch
`stringtie/`	Per-sample StringTie GTF assemblies
`isoform_db/<sample>/`	Predicted peptide FASTA from novel isoforms
`protein_db/<sample>/`	`<sample>_combined_protein_db.fa` and `<sample>_novel_protein_db.fa`
`diann_output/<sample>/`	DIA-NN parquet report, matrices parquet, and postprocessed novel/reference TSV matrices

Credits

PG-DIA in this repository was written and adapted by Wenrong Chen in the Zhang Lab, building on nf-core components.

Citation

See CITATIONS.md for software and workflow citation details.

Name		Name	Last commit message	Last commit date
Latest commit History 268 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
modules		modules
submodules		submodules
subworkflows		subworkflows
tests		tests
workflows		workflows
.codex_collect_test.nf		.codex_collect_test.nf
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PG-DIA

What the pipeline does

Inputs

Samplesheet

Required reference and workflow parameters

Quick start

Reference configuration

Main outputs

Credits

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PG-DIA

What the pipeline does

Inputs

Samplesheet

Required reference and workflow parameters

Quick start

Reference configuration

Main outputs

Credits

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages