CT-SLEB

Overview

This vignette provides a tutorial on the CT-SLEB method. CT-SLEB is a multi-ancestry polygenic risk score (PRS) method using summary statistics from genomewide association studies (GWAS) of multiple populations. The method contains three major steps: 1. Two-dimensional clumping and thresholding to select SNPs; 2. Empirical-Bayes method to estimate the effect sizes for the SNPs; 3. Super-learning step to combine PRS under different tuning parameters (p-value thresholds, clumping window size, r2-cutoff). We use PLINK 1.90 for clumping (flag: --clump), which currently is not supported in PLINK 2.0. We use PLINK 2.0 for calculating PRS (command: --score) because it allows calculating PRS with multiple effect sizes columns simultaneously (--score-col-nums). We use R package for implementing empirical-Bayes and super-learning steps. For simplicity of demonstration, we will use the system( ) command to call plink directly from R. The user can also directly use plink through the terminal.

Data and Software Preparation

Before starting the analyses, the user needs to download the following example dataset. This dataset contains example data for chromosome 22. The folder contains the information: 1. summary statistics for five populations (named as _sumdata); 2. outcome for tuning and validation dataset (named as y_tuning and y_validation); 3. genotype data with 20,000 subjects for tuning and validation (named as AFR_test_chr); 4. The reference data from 1000 Genomes Projects (1KG) for the clumping step (named as: AFR_ref_chr22 and EUR_ref_chr22). The directory to this example dataset is set to be “data_dir” in the code. We also need to download PLINK 1.9 and PLINK 2.0. To distinguish the two software, we name PLINK 1.9 as plink and PLINK2.0 as plink2 in the code.

#install the CTSLEB package
#install.packages("devtools")
library(devtools)

## Loading required package: usethis

#install_github("andrewhaoyu/CTSLEB")
library(CTSLEB)
library(data.table)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#Specify the directory to the data folder
data_dir = "data/"
#Specify the directory for the summary statistics
EUR_sumstats_file <- paste0(data_dir,"EUR_sumdata.txt") #  reference population
AFR_sumstats_file <- paste0(data_dir,"AFR_sumdata.txt") #  target population
#Specify the directory to the reference data for clumping purpose
EUR_ref_plinkfile <- paste0(data_dir,"EUR_ref_chr22")
AFR_ref_plinkfile <- paste0(data_dir,"AFR_ref_chr22")
#Specify the tuning and validation data directory
AFR_test_plinkfile <- paste0(data_dir,"AFR_test_chr22")
#Specify the directory to PLINK1.9 and PLINK2.0
plink19_exec <- "/data/williamsjacr/software/plink"
plink2_exec <- "/data/williamsjacr/software/plink2"
#Specify the directory to the result directory
out_dir = ""
outprefix <- "chr22"

Step 1: Two-dimensional Clumping and Thresholding (CT)

To demonstrate the method, we use the GWAS summary statistics from Chromosome 22 for European (EUR) and African (AFR) to construct the PRS for the AFR population. First, we load the GWAS summary statistics for two populations. The summary statistics contain these columns: 1. CHR; 2. SNP: the SNP ID used in 1000 Genomes Project Phase 3 Data (1KG); 3. BP: SNP position (GRCh37); 4. A1: effect allele; 5. BETA: effect size; 6. SE: standard error of the effect size 7. P: p-value for the association test; 8. rs_id: the rsid of the SNP. EUR population is usually the GWAS with the largest sample size. Although I am using the EUR population as an example here, the other population is not limited to the EUR population. You can use GWAS from other populations as long as you put a corresponding reference for the clumping step.

Step1 extends the single ancestry clumping and thresholding method (CT) to the multi-ancestry setting. Instead of only applying the CT method on either the EUR or the target population, CT uses two-different p-value cutoffs to select variants.

CT starts with the clumping step. We split the SNPs into two groups: 1. SNPs with p-values in the EUR population smaller than the target population. 2. Target population-specific SNPs or SNPs with p-values in the target population smaller than the EUR population. SNPs are clumped using the linkage disequilibrium (LD) estimates from the EUR population for the first group. Next, SNPs are clumped using the LD estimates from the target population. After the clumping, SNPs from the two groups are combined for the thresholding step for the second group.

library(data.table)
#load data from the EUR and the target population
sum_EUR <- fread(paste0(data_dir,"EUR_sumdata.txt"),header=T)
sum_AFR <- fread(paste0(data_dir,"AFR_sumdata.txt"),header=T)
head(sum_EUR)

##      CHR                      SNP       BP     A1      BETA          SE      P
##    <int>                   <char>    <int> <char>     <num>       <num>  <num>
## 1:    22  rs55926024:16054740:A:G 16054740      G  0.003215 0.004537115 0.4786
## 2:    22   rs4389403:16055070:G:A 16055070      A  0.006085 0.006921065 0.3793
## 3:    22 rs117246541:16055122:G:T 16055122      T  0.009954 0.008392917 0.2356
## 4:    22  rs72613661:16055942:C:T 16055942      C -0.002443 0.005135590 0.6343
## 5:    22   rs9617528:16061016:T:C 16061016      C -0.002830 0.005117541 0.5803
## 6:    22 rs140813654:16070003:C:T 16070003      T -0.003224 0.005580751 0.5635
##          rs_id
##         <char>
## 1:   rs2844885
## 2:   rs4389403
## 3: rs117246541
## 4:    rs738830
## 5:   rs9617528
## 6:  rs62224655

head(sum_AFR)

##      CHR                      SNP       BP     A1      BETA         SE      P
##    <int>                   <char>    <int> <char>     <num>      <num>  <num>
## 1:    22  rs55926024:16054740:A:G 16054740      G -0.013930 0.01157938 0.2290
## 2:    22   rs4389403:16055070:G:A 16055070      A -0.013550 0.02964989 0.6477
## 3:    22  rs72613661:16055942:C:T 16055942      C -0.001551 0.01309966 0.9057
## 4:    22  rs60899456:16056936:C:T 16056936      T  0.017700 0.02075516 0.3938
## 5:    22   rs9617528:16061016:T:C 16061016      C  0.005589 0.02181499 0.7978
## 6:    22 rs140813654:16070003:C:T 16070003      T  0.009752 0.02593617 0.7069
##         rs_id
##        <char>
## 1:  rs2844885
## 2:  rs4389403
## 3:   rs738830
## 4: rs60899456
## 5:  rs9617528
## 6: rs62224655

#Prepare the parameters for PLINK clumping purpose

PRS_farm <- SetParamsFarm(plink19_exec = plink19_exec,
                          plink2_exec = plink2_exec)

## $plink19_exec
## [1] "/data/williamsjacr/software/plink"
## 
## $plink2_exec
## [1] "/data/williamsjacr/software/plink2"
## 
## $r2_vec
## [1] 0.01 0.05 0.10 0.20 0.50 0.80
## 
## $wc_base_vec
## [1]  50 100
## 
## $pthres
## [1] 5e-08 5e-07 5e-06 5e-05 5e-04 5e-03 5e-02 5e-01 1e+00
## 
## $threads
## [1] 2
## 
## $mem
## [1] 8000

prs_mat <- dimCT(results_dir = out_dir,
                 sum_target = sum_AFR,
                 sum_ref = sum_EUR,
                 ref_plink = EUR_ref_plinkfile,
                 target_plink = AFR_ref_plinkfile,
                 test_target_plink = AFR_test_plinkfile,
                 out_prefix = outprefix,
                 params_farm = PRS_farm)

## [1] "executing AlignSum()"
## [1] "AlignSum() complete ..."
## [1] "executing SplitSum() ..."
## [1] "executing WriteSplitTables()... "
## [1] "creating sum_ref object"
## [1] "creating sum_target object"
## [1] "WriteSplitTables() complete ... "
## [1] "executing RunClump()... "
## [1] "RunClump() params_farm list will be used"
## [1] "out_prefix: chr22"
## [1] "RunClump() complete ..."
## [1] "executing PreparePlinkFile()... "
## [1] "params_farm list will be used"
## [1] "unique_infor dataframe complete"
## [1] "scores dataframe complete"
## [1] "p_values dataframe complete"
## [1] "executing CreateQRange()... "
## [1] "CreateQRange() complete... "
## [1] "PreparePlinkFile() complete ..."
## [1] "executing PRSscore()... "
## [1] "combining PRS results ..."
## [1] "prs_mat complete... "
## [1] "prs_mat object created"

In this example, we only use one chromosome 22. If you are using workflow (i.e. Snakemake) which dispatches jobs by chromosome or chunks, you will need to combine the snp_list for all the jobs together before the Ebayes step. The order of clumping parameters is the same for each job. prs_mat contains 20000 samples. We will next use 10000 of these samples and calculate a set of tuning parameters. Phenotypes for tuning the target PRS is located in data_dir, “y_tuning.txt”. The first two columns of prs_tune are the family id and individual ids. The target PRSs starts from the third column.

prs_tune <- prs_mat[1:10000,]

n.total.prs <- length(pthres)^2*length(r2_vec)*length(wc_base_vec)
prs_r2_vec_test <- rep(0,n.total.prs)

y_tune <- fread(paste0(data_dir,"y_tuning.txt"))

for(p_ind in 1:n.total.prs){
  model <- lm(y_tune$V1~prs_tune[,(2+p_ind)])
  prs_r2_vec_test[p_ind] <- summary(model)$r.square
}

max_ind <- which.max(prs_r2_vec_test)

print(colnames(prs_tune)[max_ind+2])

## [1] "clump_r2_0.01_ws_10000_p_other_5e-08_p_tar_0.05"

Step 2: Empirical-Bayes (EB) Estimation of Effect Sizes

Now we move to the EB step for estimating the regression coefficients of PRSs by using the genetic correlations of effect sizes across populations. At the end of two-dimensional clumping and thresholding step, we get SNP set with corresponding tuning parameters for estimating the covariance matrix for the prior distribution. In this particular example using data from chromosome 22, it’s using clumping r2-cutoff at 0.01, window size at 10000kb, SNPs with p_EUR < 5E-08 or p_target < 0.05. In real data analyses, PRSs need to be calculated based on all chromsome together to determine the SNP set.

#find the best snp set
best_snps <- colnames(prs_tune)[max_ind+2]
#calculate eb effect using EB coefficients
prs_mat_eb <- CalculateEBEffectSize(bfile = AFR_test_plinkfile ,
                                    snp_ind = best_snps,
                                    plink_list = plink_list,
                                    out_prefix = outprefix,
                                    results_dir = out_dir,
                                    params_farm = PRS_farm)

## [1] "Executing CalculateEBEffectSize() ... "
## [1] "Executing GetSNPSet()... "
## [1] "executing EBayesPostMean()... "
## [1] "executing EstimatePrior()... "
## [1] "EstimatePrior() complete... "
## [1] "10000 SNPs completed"
## [1] "20000 SNPs completed"
## [1] "EBayesPostMean() complete... "
## [1] "Executing PreparePlinkFileEBayes() ... "
## [1] "PreparePlinkFileEBayes() complete ..."
## [1] "executing PRSscoreEbayes()... "
## [1] "combining PRS results ..."
## [1] "prs_mat complete... "
## [1] "prs_mat_eb object created"

Step 3: Super Learning

In Step 3, we input all the PRSs and tuning parameters to train the super-learning model and predict the outcome. The super-learning model is a linear combination of different predictors based on multiple supervised learning algorithms. The set of prediction algorithm can be self-designed or chosen from classical prediction algorithms. In this tutorial, we will use the Lasso, ridge regression and neural networks as our predictors. Other predictors can also be selected. The R package SuperLearner is used in this step. Detailed guidance of SuperLearner package can be found at: https://cran.r-project.org/web/packages/SuperLearner/vignettes/Guide-to-SuperLearner.html. The prs_mat_eb data.frame created at the end of Step2 contains 20,000 samples. We will split these samples into a tuning and final validation sets. The first 10,000 samples will be the tuning set and the second 10,000 samples the validation set. The PRSTrainValSplit() will perform this spit and also drop all the PRS columns with a pairwise correlation greater than 0.98.

library(caret)
prs_tune <- prs_mat_eb[1:10000,]
prs_validation <- prs_mat_eb[10001:20000,]

y_tune <- fread(paste0(data_dir,"y_tuning.txt"))

Cleaned_Data <- PRS_Clean(Tune_PRS = prs_tune,Tune_Y = y_tune,Validation_PRS = prs_validation)

y_vad_file <- paste0(data_dir,"y_validation.txt")
y_vad <- fread(y_vad_file)

prs_tune_sl <- Cleaned_Data$Cleaned_Tune_PRS
prs_valid_sl <- Cleaned_Data$Cleaned_Validation_PRS

Next train the super-learning model.

library(SuperLearner)

## Loading required package: nnls

## Loading required package: gam

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.22-3

## Super Learner

## Version: 2.0-29

## Package created on 2024-02-06

library(ranger)
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-8

SL.library <- c(
  "SL.glmnet",
  "SL.ridge"
)

sl <- SuperLearner(Y = y_tune$V1, 
                   X = prs_tune_sl[,-c(1,2)], 
                   family = gaussian(),
                   SL.library = SL.library)

Predict the outcome using the independent validation dataset and evaluate the CT-SLEB PRS performance.

y_pred_valid <- predict(sl, prs_valid_sl[,-c(1,2)], onlySL = TRUE)

model <- lm(y_vad$V1~y_pred_valid$pred)
r2_ctsleb <- summary(model)$r.square
print(r2_ctsleb)

## [1] 0.0001450043

Extract the final coefficients for each SNP using the ExtractFinalBetas function.

y_pred_tune <- predict(sl, prs_tune_sl[,-c(1,2)], onlySL = TRUE)
Predicted_Tune_Y <- y_pred_tune$pred
Tune_PRS <- prs_tune_sl[,-c(1,2)]

Final_Betas <- ExtractFinalBetas(Tune_PRS = prs_tune_sl[,-c(1,2)],Predicted_Tune_Y = y_pred_tune$pred,prs_mat_eb = prs_mat_eb,unique_infor_post = unique_infor_post,pthres = pthres)

Write the final betas to file and test the accuracy of them in plink.

write.table(Final_Betas,file = paste0(out_dir,"Final_PRS_Coefficients"),col.names = T,row.names = F,quote=F)
system(paste0(plink2_exec," --threads 2 --score ",out_dir,"Final_PRS_Coefficients cols=+scoresums,-scoreavgs header no-mean-imputation  --bfile ",AFR_test_plinkfile," --out ",out_dir,"Final_Betas_Evaluation"))

Final_Betas_Evaluation <- read.delim(paste0(out_dir,"Final_Betas_Evaluation.sscore"), header=FALSE, comment.char="#")
cor(Final_Betas_Evaluation[,5],c(y_pred_tune$pred,y_pred_valid$pred))

## [1] 0.9999948