Representation in genetic studies affects inference about genetic architecture

Jared M. Cole, Shane Rybacki, Samuel Pattillio Smith, Olivia S. Smith, and Arbel Harpak.

Welcome! Here you can find the code to reproduce the analyses of "Representation in genetic studies affects inference about genetic architecture".

Overview:

Data files generated from this project can be found at Zenodo:
Code used to perform all analyses and plotting can be found in the scripts directory.
The following software was used to perform analyses:
- R
- Python
- RStudio
- plink 2.0
- ashr
- LDSC
- S-LD4M
The following R packages were used for plotting and data parsing:
- optparse
- dplyr
- tidyverse
- Biostrings
- ggplot2
- ggExtra
- ggrepel
- patchwork
- grid
- gridExtra
- cowplot
- scales
- forcats
- data.table

Data

Please visit the Zenodo repository (https://doi.org/10.5281/zenodo.18228037) for the generated output files.

Summary statistics generated by the Neale lab can be accessed publicly at: https://www.nealelab.is/uk-biobank.

Access to the All of Us data is publicly available to registered researchers via the All of Us Researcher Workbench.

FinnGen GWAS summary statistics are available at https://r12.finngen.fi.

Script documentation

Code is number-labeled according to the rough order of when these analyses appear in the text. (UKB = UK Biobank, AoU = All of Us, FG = FinnGen)

1. Retrieve UKB and FG data

1_Retrieve_data_UKB_FG.sh: Retrieves metadata and phenotype fields from UKB and retrieves GWAS summary statistics from the Neale lab's v3 GWAS and Finngen.

2. Build AoU phenotypes and perform GWAS

All the following analyses were performed on the AoU Researcher workbench.

2-2_AoU_retrieve_ICD10_codes.R: Script to retrieve ICD10 codes for binary traits.
2-3_AoU_retreieve_lab_measurements.py: Script to retrieve laboratory measurements (monocyte, basophil, and neutrophil percentages).
2-4_AoU_retrieve_physical_measures.py: Script to retrieve physical measurements (Height, Weight, BMI).
2-5_AoU_phenotype_curation.R: Phenotype QC and processing.
2-6_AoU_retrieve_ACAF.sh : Script to retrieve genetic data (ACAF CDv8).
2-7_AoU_matched_UKB_variants.R: Gets a set of variants present in UKB.
2-8_AoU_variant_QC.sh: Variant QC steps.
2-9_AoU_GWAS_plink.sh: Performs GWAS on all traits.

3. Build UKB phenotypes (also perform split GWAS)

3-1_UKB_phenotype_QC_GWAS_processing.R: Phenotype QC and processing.
3-2_UKB_GWAS_group1.sh: Performs GWAS on first group.
3-3_UKB_GWAS_group2.sh: Performs GWAS on second group.

4. LDSC analyses (h2 and rg) and S-LD4M inputs

4_Prep_sumstats.R: Prepare summary statistics for munge step in LDSC and write S-LD4M inputs per trait (chi-square statistics and rsids).
4-1_UKB_AoU_munge_sumstats.sh: Run the prep script and munge summary statstics for LDSC.
4-2_UKB_AoU_LDSC.sh: Run LDSC.
4-3_Prep_UKB_group_sumstats.R: Prepare summary statistics for munge step in LDSC for UKB groups.
4-4_UKB_group_LDSC.sh: Munge and run LDSC for split UKB groups.

5. Simulate cohort sampling and GWAS

5-1_Simulate_skewed_cohorts_schemeA.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme A (described in Methods, Fig 3., panels A-C). Parameter definitions and descriptions (e.g., N_POP, N_SAMP, GAMMA_FIXED, TAU_FIXED) and usage guide are documented in the script header block. Included plotting functions for Fig 3 A-C.
5-2_Simulate_skewed_cohorts_schemeB.R: Script for performing biased cohort sampling, GWAS, and sign-bias under scheme B (described in Methods, Fig 3., panels D-F). Parameter definitions and descriptions and usage guide are documented in the script header block. Included plotting functions for Fig 3 D-F.

6. Clean GWAS sumstats

6-1_Clean_GWAS_AoU.R: Clean GWAS sumstats (AoU) (prepare for downstream analyses).
6-2_Clean_GWAS_UKB.R: Clean GWAS sumstats (UKB) (prepare for downstream analyses).
6-2_Clean_GWAS_FG.R: Clean GWAS sumstats (FG) (prepare for downstream analyses).
6-4_Run_cleaning.sh: Runs the previous code on all summary statistics.

7. Run ASH, calculate sign bias, and binning

7-1_Run_ash_GWAS.R: Performs ash step and calculates per-site sign bias on cleaned input GWAS.
7-2_ash_all.sh: Runs previous code on all cleaned summary statistics.
7-3_Bin_and_Calc_sign_bias.R: Takes ash-derived SNP sign estimates from previous step and computes a sign-bias summary using either (i) one random SNP per LD block or (ii) the lowest-p (from GWAS) SNP per LD block.
7-4_Run_binning_calc.sh: Runs the previous code on all summary statistics.

8. Plotting, regression models, and summary analyses

8-1_Plotting_and_LM_models.R: loads summary results from UKB, AoU, and FinnGen, then generates the main (except Fig 3) and supplemental figures and fits regression models to quantify cross-cohort differences and relate sign bias to trait skewness.

9. Ancestry-matching between AoU and UKB samples

Details about the ancestry-matching procedure can be found at https://github.com/harpak-lab/ancestry-matching.

10. Estimating effective polygenicity

Software for S-LD4M can be found at https://github.com/lukejoconnor/SLD4M.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Representation in genetic studies affects inference about genetic architecture

Overview:

Data

Script documentation

1. Retrieve UKB and FG data

2. Build AoU phenotypes and perform GWAS

3. Build UKB phenotypes (also perform split GWAS)

4. LDSC analyses (h2 and rg) and S-LD4M inputs

5. Simulate cohort sampling and GWAS

6. Clean GWAS sumstats

7. Run ASH, calculate sign bias, and binning

8. Plotting, regression models, and summary analyses

9. Ancestry-matching between AoU and UKB samples

10. Estimating effective polygenicity

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Representation in genetic studies affects inference about genetic architecture

Overview:

Data

Script documentation

1. Retrieve UKB and FG data

2. Build AoU phenotypes and perform GWAS

3. Build UKB phenotypes (also perform split GWAS)

4. LDSC analyses (h2 and rg) and S-LD4M inputs

5. Simulate cohort sampling and GWAS

6. Clean GWAS sumstats

7. Run ASH, calculate sign bias, and binning

8. Plotting, regression models, and summary analyses

9. Ancestry-matching between AoU and UKB samples

10. Estimating effective polygenicity

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages