Calculate SNP-based population statistics over groups of samples in VCF files with:
- indexable BED output
- correct handling of missing data
- support for polyploid variant calls
- higher data yield due to per-group ALT-agnostic SNP retrieval
- a broad selection of statistics, extensible with modules
- convenient helper tools for making genomic windows, filtering and summarizing the results
- the power of GNU AWK: no installation, competitive speed, low memory footprint, and multiprocessing 👀
Warning
piawka is under development. At this stage, breaking changes are not unthinkable of. If something does not seem to work well, check newer versions and do not hesitate to file an issue!
conda install -c bioconda piawkaAlternatively, have the following programs available in the command line and clone the repo:
gawk>=v5.2.0tabixbgzip
git clone https://github.com/novikovalab/piawka.git
export PATH="$( realpath ./piawka ):${PATH}"Docs are available at https://novikovalab.github.io/piawka.
Mandatory (for piawka calc):
- VCF file -- bgzipped and tabixed
Optional:
- groups file -- 2-column TSV with sample ID and group ID (may include relevant samples only)
- regions/targets file -- BED file to restrict/split output by regions
Output is a BED file:
$ cd piawka/examples
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi,dxy
#chr start end locus pop1 pop2 stat value numerator denominator
scaffold_1 10035093 10035276 AL5G20950 CESiberia_2n LE_2n dxy 0.0071137 460 64664
scaffold_1 10035093 10035276 AL5G20950 PUWS_4n . pi 0.00588993 640 108660
scaffold_1 10035093 10035276 AL5G20950 LE_2n PUWS_4n dxy 0.00881262 1102 125048
scaffold_1 10035093 10035276 AL5G20950 LE_2n . pi 0.00772461 1078 139554
...piawka calc: calculate various population statistics from a VCF filepiawka dist: convert calc output to PHYLIP or NEXUS distance matrixpiawka filt: filter piawka output using AWK expressionspiawka list: show all statistics available for calculationpiawka sum: summarize stats from calc output across regionspiawka win: prepare genomic windows from various sources
Within groups:
lines: number of lines used in calculationmiss: share of missing genotype callspi: expected heterozygosity = nucleotide diversitymaf: minor allele frequencydaf: alternative ("derived") allele frequencytajima: Tajima's Dtajimalike: Tajima's D interpolated for missing genotypes (experimental)theta_w: Watterson's thetatheta_low: Theta estimator based on sites with 0<allele_freq<0.33theta_mid: Theta estimator based on sites with 0.33<=allele_freq<0.66theta_high: Theta estimator based on sites with 0.33<=allele_freq<0.66
Between groups (pairwise):
afd: average allele frequency differencedxy: absolute nucleotide divergencefst: fixation index, Hudson's estimatorfstwc: fixation index, Weir & Cockerham's estimatorrho: Ronfort's rhonei: Nei's D standard genetic distance
First mention of piawka as well as the test data are coming from https://doi.org/10.1093/molbev/msaf153.