Jared M. Cole, Shane Rybacki, Samuel Pattillio Smith, Olivia S. Smith, and Arbel Harpak.
Welcome! Here you can find the code to reproduce the analyses of "Representation in genetic studies affects inference about genetic architecture".
- Data files generated from this project can be found at Zenodo:
- Code used to perform all analyses and plotting can be found in the scripts directory.
- The following software was used to perform analyses:
- The following R packages were used for plotting and data parsing:
- optparse
- dplyr
- tidyverse
- Biostrings
- ggplot2
- ggExtra
- ggrepel
- patchwork
- grid
- gridExtra
- cowplot
- scales
- forcats
- data.table
Please visit the Zenodo repository (https://doi.org/10.5281/zenodo.18228037) for the generated output files.
Summary statistics generated by the Neale lab can be accessed publicly at: https://www.nealelab.is/uk-biobank.
Access to the All of Us data is publicly available to registered researchers via the All of Us Researcher Workbench.
FinnGen GWAS summary statistics are available at https://r12.finngen.fi.
Code is number-labeled according to the rough order of when these analyses appear in the text. (UKB = UK Biobank, AoU = All of Us, FG = FinnGen)
1_Retrieve_data_UKB_FG.sh: Retrieves metadata and phenotype fields from UKB and retrieves GWAS summary statistics from the Neale lab's v3 GWAS and Finngen.
All the following analyses were performed on the AoU Researcher workbench.
2-2_AoU_retrieve_ICD10_codes.R: Script to retrieve ICD10 codes for binary traits.2-3_AoU_retreieve_lab_measurements.py: Script to retrieve laboratory measurements (monocyte, basophil, and neutrophil percentages).2-4_AoU_retrieve_physical_measures.py: Script to retrieve physical measurements (Height, Weight, BMI).2-5_AoU_phenotype_curation.R: Phenotype QC and processing.2-6_AoU_retrieve_ACAF.sh: Script to retrieve genetic data (ACAF CDv8).2-7_AoU_matched_UKB_variants.R: Gets a set of variants present in UKB.2-8_AoU_variant_QC.sh: Variant QC steps.2-9_AoU_GWAS_plink.sh: Performs GWAS on all traits.
3-1_UKB_phenotype_QC_GWAS_processing.R: Phenotype QC and processing.3-2_UKB_GWAS_group1.sh: Performs GWAS on first group.3-3_UKB_GWAS_group2.sh: Performs GWAS on second group.
4_Prep_sumstats.R: Prepare summary statistics for munge step in LDSC and write S-LD4M inputs per trait (chi-square statistics and rsids).4-1_UKB_AoU_munge_sumstats.sh: Run the prep script and munge summary statstics for LDSC.4-2_UKB_AoU_LDSC.sh: Run LDSC.4-3_Prep_UKB_group_sumstats.R: Prepare summary statistics for munge step in LDSC for UKB groups.4-4_UKB_group_LDSC.sh: Munge and run LDSC for split UKB groups.
5-1_Simulate_skewed_cohorts_schemeA.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme A (described in Methods, Fig 3., panels A-C). Parameter definitions and descriptions (e.g.,N_POP,N_SAMP,GAMMA_FIXED,TAU_FIXED) and usage guide are documented in the script header block. Included plotting functions for Fig 3 A-C.5-2_Simulate_skewed_cohorts_schemeB.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme B (described in Methods, Fig 3., panels D-F). Parameter definitions and descriptions and usage guide are documented in the script header block. Included plotting functions for Fig 3 D-F.
6-1_Clean_GWAS_AoU.R: Clean GWAS sumstats (AoU) (prepare for downstream analyses).6-2_Clean_GWAS_UKB.R: Clean GWAS sumstats (UKB) (prepare for downstream analyses).6-2_Clean_GWAS_FG.R: Clean GWAS sumstats (FG) (prepare for downstream analyses).6-4_Run_cleaning.sh: Runs the previous code on all summary statistics.
7-1_Run_ash_GWAS.R: Performs ash step and calculates per-site sign bias on cleaned input GWAS.7-2_ash_all.sh: Runs previous code on all cleaned summary statistics.7-3_Bin_and_Calc_sign_bias.R: Takes ash-derived SNP sign estimates from previous step and computes a sign-bias summary using either (i) one random SNP per LD block or (ii) the lowest-p (from GWAS) SNP per LD block.7-4_Run_binning_calc.sh: Runs the previous code on all summary statistics.
8-1_Plotting_and_LM_models.R: loads summary results from UKB, AoU, and FinnGen, then generates the main (except Fig 3) and supplemental figures and fits regression models to quantify cross-cohort differences and relate sign bias to trait skewness.
Details about the ancestry-matching procedure can be found at https://github.com/harpak-lab/ancestry-matching.
Software for S-LD4M can be found at https://github.com/lukejoconnor/SLD4M.