Skip to content

novikovalab/piawka

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

365 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

piawka

Conda Downloads

Calculate SNP-based population statistics over groups of samples in VCF files with:

  • indexable BED output
  • correct handling of missing data
  • support for polyploid variant calls
  • higher data yield due to per-group ALT-agnostic SNP retrieval
  • a broad selection of statistics, extensible with modules
  • convenient helper tools for making genomic windows, filtering and summarizing the results
  • the power of GNU AWK: no installation, competitive speed, low memory footprint, and multiprocessing 👀

Warning

piawka is under development. At this stage, breaking changes are not unthinkable of. If something does not seem to work well, check newer versions and do not hesitate to file an issue!

Installation

conda install -c bioconda piawka

Alternatively, have the following programs available in the command line and clone the repo:

  • gawk>=v5.2.0
  • tabix
  • bgzip
git clone https://github.com/novikovalab/piawka.git
export PATH="$( realpath ./piawka ):${PATH}"

Usage

Docs are available at https://novikovalab.github.io/piawka.

Input and output

Mandatory (for piawka calc):

  • VCF file -- bgzipped and tabixed

Optional:

  • groups file -- 2-column TSV with sample ID and group ID (may include relevant samples only)
  • regions/targets file -- BED file to restrict/split output by regions

Output is a BED file:

$ cd piawka/examples
$ piawka calc -v alyrata_scaff_1_10000k-10500k.vcf.gz -b genes.bed -g groups.tsv -s pi,dxy
#chr        start     end       locus      pop1              pop2              stat    value        numerator  denominator
scaffold_1  10035093  10035276  AL5G20950  CESiberia_2n      LE_2n             dxy     0.0071137    460        64664
scaffold_1  10035093  10035276  AL5G20950  PUWS_4n           .                 pi      0.00588993   640        108660
scaffold_1  10035093  10035276  AL5G20950  LE_2n             PUWS_4n           dxy     0.00881262   1102       125048
scaffold_1  10035093  10035276  AL5G20950  LE_2n             .                 pi      0.00772461   1078       139554
...

Subcommands

  • piawka calc: calculate various population statistics from a VCF file
  • piawka dist: convert calc output to PHYLIP or NEXUS distance matrix
  • piawka filt: filter piawka output using AWK expressions
  • piawka list: show all statistics available for calculation
  • piawka sum: summarize stats from calc output across regions
  • piawka win: prepare genomic windows from various sources

Statistics

Within groups:

  • lines: number of lines used in calculation
  • miss: share of missing genotype calls
  • pi: expected heterozygosity = nucleotide diversity
  • maf: minor allele frequency
  • daf: alternative ("derived") allele frequency
  • tajima: Tajima's D
  • tajimalike: Tajima's D interpolated for missing genotypes (experimental)
  • theta_w: Watterson's theta
  • theta_low: Theta estimator based on sites with 0<allele_freq<0.33
  • theta_mid: Theta estimator based on sites with 0.33<=allele_freq<0.66
  • theta_high: Theta estimator based on sites with 0.33<=allele_freq<0.66

Between groups (pairwise):

  • afd: average allele frequency difference
  • dxy: absolute nucleotide divergence
  • fst: fixation index, Hudson's estimator
  • fstwc: fixation index, Weir & Cockerham's estimator
  • rho: Ronfort's rho
  • nei: Nei's D standard genetic distance

Citation

First mention of piawka as well as the test data are coming from https://doi.org/10.1093/molbev/msaf153.

About

The `awk` script to calculate population stats in VCF files: convenient, simple, flexible

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors