Skip to content

jmcole003/sign_bias_gwas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 

Repository files navigation

Representation in genetic studies affects inference about genetic architecture

Jared M. Cole, Shane Rybacki, Samuel Pattillio Smith, Olivia S. Smith, and Arbel Harpak.

Welcome! Here you can find the code to reproduce the analyses of "Representation in genetic studies affects inference about genetic architecture".

Overview:

  1. Data files generated from this project can be found at Zenodo: DOI
  2. Code used to perform all analyses and plotting can be found in the scripts directory.
  3. The following software was used to perform analyses:
  4. The following R packages were used for plotting and data parsing:
    • optparse
    • dplyr
    • tidyverse
    • Biostrings
    • ggplot2
    • ggExtra
    • ggrepel
    • patchwork
    • grid
    • gridExtra
    • cowplot
    • scales
    • forcats
    • data.table

Data

Please visit the Zenodo repository (https://doi.org/10.5281/zenodo.18228037) for the generated output files.

Summary statistics generated by the Neale lab can be accessed publicly at: https://www.nealelab.is/uk-biobank.

Access to the All of Us data is publicly available to registered researchers via the All of Us Researcher Workbench.

FinnGen GWAS summary statistics are available at https://r12.finngen.fi.

Script documentation

Code is number-labeled according to the rough order of when these analyses appear in the text. (UKB = UK Biobank, AoU = All of Us, FG = FinnGen)

1. Retrieve UKB and FG data

  • 1_Retrieve_data_UKB_FG.sh: Retrieves metadata and phenotype fields from UKB and retrieves GWAS summary statistics from the Neale lab's v3 GWAS and Finngen.

2. Build AoU phenotypes and perform GWAS

All the following analyses were performed on the AoU Researcher workbench.

  • 2-2_AoU_retrieve_ICD10_codes.R: Script to retrieve ICD10 codes for binary traits.
  • 2-3_AoU_retreieve_lab_measurements.py: Script to retrieve laboratory measurements (monocyte, basophil, and neutrophil percentages).
  • 2-4_AoU_retrieve_physical_measures.py: Script to retrieve physical measurements (Height, Weight, BMI).
  • 2-5_AoU_phenotype_curation.R: Phenotype QC and processing.
  • 2-6_AoU_retrieve_ACAF.sh : Script to retrieve genetic data (ACAF CDv8).
  • 2-7_AoU_matched_UKB_variants.R: Gets a set of variants present in UKB.
  • 2-8_AoU_variant_QC.sh: Variant QC steps.
  • 2-9_AoU_GWAS_plink.sh: Performs GWAS on all traits.

3. Build UKB phenotypes (also perform split GWAS)

  • 3-1_UKB_phenotype_QC_GWAS_processing.R: Phenotype QC and processing.
  • 3-2_UKB_GWAS_group1.sh: Performs GWAS on first group.
  • 3-3_UKB_GWAS_group2.sh: Performs GWAS on second group.

4. LDSC analyses (h2 and rg) and S-LD4M inputs

  • 4_Prep_sumstats.R: Prepare summary statistics for munge step in LDSC and write S-LD4M inputs per trait (chi-square statistics and rsids).
  • 4-1_UKB_AoU_munge_sumstats.sh: Run the prep script and munge summary statstics for LDSC.
  • 4-2_UKB_AoU_LDSC.sh: Run LDSC.
  • 4-3_Prep_UKB_group_sumstats.R: Prepare summary statistics for munge step in LDSC for UKB groups.
  • 4-4_UKB_group_LDSC.sh: Munge and run LDSC for split UKB groups.

5. Simulate cohort sampling and GWAS

  • 5-1_Simulate_skewed_cohorts_schemeA.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme A (described in Methods, Fig 3., panels A-C). Parameter definitions and descriptions (e.g., N_POP, N_SAMP, GAMMA_FIXED, TAU_FIXED) and usage guide are documented in the script header block. Included plotting functions for Fig 3 A-C.
  • 5-2_Simulate_skewed_cohorts_schemeB.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme B (described in Methods, Fig 3., panels D-F). Parameter definitions and descriptions and usage guide are documented in the script header block. Included plotting functions for Fig 3 D-F.

6. Clean GWAS sumstats

  • 6-1_Clean_GWAS_AoU.R: Clean GWAS sumstats (AoU) (prepare for downstream analyses).
  • 6-2_Clean_GWAS_UKB.R: Clean GWAS sumstats (UKB) (prepare for downstream analyses).
  • 6-2_Clean_GWAS_FG.R: Clean GWAS sumstats (FG) (prepare for downstream analyses).
  • 6-4_Run_cleaning.sh: Runs the previous code on all summary statistics.

7. Run ASH, calculate sign bias, and binning

  • 7-1_Run_ash_GWAS.R: Performs ash step and calculates per-site sign bias on cleaned input GWAS.
  • 7-2_ash_all.sh: Runs previous code on all cleaned summary statistics.
  • 7-3_Bin_and_Calc_sign_bias.R: Takes ash-derived SNP sign estimates from previous step and computes a sign-bias summary using either (i) one random SNP per LD block or (ii) the lowest-p (from GWAS) SNP per LD block.
  • 7-4_Run_binning_calc.sh: Runs the previous code on all summary statistics.

8. Plotting, regression models, and summary analyses

  • 8-1_Plotting_and_LM_models.R: loads summary results from UKB, AoU, and FinnGen, then generates the main (except Fig 3) and supplemental figures and fits regression models to quantify cross-cohort differences and relate sign bias to trait skewness.

9. Ancestry-matching between AoU and UKB samples

Details about the ancestry-matching procedure can be found at https://github.com/harpak-lab/ancestry-matching.

10. Estimating effective polygenicity

Software for S-LD4M can be found at https://github.com/lukejoconnor/SLD4M.

About

Code for "Representation in genetic studies affects inference about genetic architecture"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors