Skip to contents

Introduction

The growing availability of larger, more ethnically diverse reference panels has increased the need for efficient re-imputation of genotype data and updated GWAS. The GAUSS package streamlines these tasks with its two specialized functions—dist() and distmix()—for imputing association Z-scores for unmeasured SNPs. This vignette will guide you through the practical application of both dist() and distmix() for imputing association Z-scores in both ethnically homogeneous and multi-ethnic cohorts.

Load necessary packages

Preparing the Input Data

Both dist() and distmix() functions require two core input datasets:

  • Association Z-score Data
  • Reference Panel Data

Association Z-score Data

This data should be in the form of a space-delimited text file including six columns with the following names:

  • rsid: SNP ID
  • chr: Chromosome number
  • bp: Base pair position
  • a1: Reference allele
  • a2: Alternative allele
  • z: Association Z-score

For this example, we will use the Psychiatric Genomic Consortium’s Phase 2 Schizophrenia (PGC SCZ2) GWAS dataset.

# Path to the input file
input_file <- "../data/PGC2_3Mb.txt"

# Input file should include six columns (rsid, chr, bp, a1, a2, and z)
input.data <- fread(input_file, header = TRUE)
head(input.data)
#>          rsid chr        bp a1 a2         z
#> 1:  rs1004467  10 104594507  A  G  6.686674
#> 2:  rs1008013  10 103548866  A  T -1.769923
#> 3: rs10128116  10 103717613  A  G -1.883298
#> 4:  rs1015037  10 105547517  T  G -1.917614
#> 5: rs10159775  10 103184297  A  G  1.304979
#> 6: rs10159838  10 105473937  A  G  2.582526

Reference Panel Data

We will use the 33KG reference panel. Replace the file paths below with those of your actual reference panel files.

reference_index_file <-"../ref/Human/33KG/33kg_index.gz"
reference_data_file <- "../ref/Human/33KG/33kg_geno.gz"
reference_pop_desc_file<-"../ref/Human/33KG/33kg_pop_desc.txt"

The dist() function

The dist() function is specifically designed for ethnically homogeneous cohorts. It allows the direct imputation of association Z-scores for unmeasured SNPs, making it easier to update and enhance GWAS analyses.

Arguments

The dist() function accepts following arguments:

  • chr: Chromosome number.
  • start_bp: Starting base pair position of the estimation window.
  • end_bp: Ending base pair position of the estimation window.
  • wing_size: Size of the area flanking the left and right of the estimation window
  • study_pop: Study population group
  • input_file: File name of the association Z-score data
  • reference_index_file: File name of reference panel index data
  • reference_data_file: File name of reference panel data
  • reference_pop_desc_file: File name of reference panel population description data
  • af1_cutoff: Cutoff of reference allele (a1) frequency

Outputs

The dist() function returns a data frame with following columns:

  • rsid: SNP ID.
  • chr: Chromosome number.
  • bp: Base pair position.
  • a1: Reference allele.
  • a2: Alternative allele.
  • af1ref: Reference allele frequency.
  • z: Association Z-score.
  • pval: Association P-value.
  • info: Imputation information, ranging from 0 to 1.
  • type: Type of variant. A value of 0 indicates an imputed variant, while 1 denotes a measured variant.

Example Usage

In this example, we will execute the dist() function to impute association Z-scores of missing SNPs in a 1Mb genomic region (Chromosome 10: 104 - 105 Mb) of PGC SCZ2 study. For the sake of this example, let’s assume that the study cohort consists of participants of European descent, represented as “EUR.” Therefore, we’ll set study_pop = "EUR" to utilize genotype data for European subjects from the 33KG reference panel.

af1_cutoff = 0.001

res <- dist(chr=10, 
            start_bp = 104000001, 
            end_bp = 105000000, 
            wing_size = 500000, 
            study_pop = "EUR",
            input_file = input_file, 
            reference_index_file = reference_index_file,
            reference_data_file = reference_data_file, 
            reference_pop_desc_file = reference_pop_desc_file,
            af1_cutoff = af1_cutoff)

Results

head(res) %>% kable("html")
rsid chr bp a1 a2 af1ref z pval info type
rs117589665 10 104000008 G A 0.05720 3.7785313 0.0001578 0.9498775 0
rs530689457 10 104000125 T C 0.00336 -1.2757191 0.2020548 0.0831094 0
rs9664049 10 104000307 T C 0.61243 -0.4576290 0.6472190 0.9859440 0
rs149691625 10 104000837 T C 0.00351 -2.9077590 0.0036403 0.0870822 0
rs112009583 10 104001402 T C 0.01793 0.6621509 0.5078745 0.9589020 0
rs35200058 10 104002372 A G 0.00575 1.4120431 0.1579373 0.1878804 0

The distmix() function

The distmix() function is designed for multi-ethnic cohorts, extending the capabilities of dist() to accommodate the complexities introduced by ethnic diversity in the data.

Arguments

The distmix() function takes a set of arguments that are largely identical to those for dist(). However, instead of the study_pop argument, distmix() incorporates pop_wgt_df:

  • pop_wgt_df: An R data frame that specifies the population IDs and their respective ancestry proportions.

Outputs

The output of distmix() is also largely identical to that of dist(), with one exception: the column af1mix replaces af1ref.

  • af1mix: An estimated reference allele frequency for the variant in the study cohort. It is calculated as a weighted sum of the reference allele frequencies across different populations in the reference panel.

Example Usage

Before using distmix(), you need to prepare ancestry proportion data. This data should be structured as a data frame containing two columns:

  • pop: Population abbreviation
  • wgt: Weight or proportion of each population in the study cohort.

You can estimate these proportions using the afmix() function. For a step-by-step guide on this process, refer to the afmix() vignette.

Here, we load pre-generated ancestry proportion data:

# Load the ancestry proportion data
data("PGC2_SCZ_ANC_Prop") # data frame name: PGC2_SCZ_ANC_Prop
head(PGC2_SCZ_ANC_Prop)
#>   pop   wgt
#> 1 ACB 0.006
#> 2 ASW 0.036
#> 3 BEB 0.005
#> 4 CCE 0.008
#> 5 CCS 0.004
#> 6 CDX 0.018

Now, let’s proceed to use distmix() for imputing association Z-scores of missing SNPs in a 1Mb genomic region on Chromosome 10, ranging from 104 to 105 Mb.

af1_cutoff = 0.001

res.mix <- distmix(chr=10, 
            start_bp = 104000001, 
            end_bp = 105000000, 
            wing_size = 500000, 
            pop_wgt_df = PGC2_SCZ_ANC_Prop,
            input_file = input_file, 
            reference_index_file = reference_index_file,
            reference_data_file = reference_data_file, 
            reference_pop_desc_file = reference_pop_desc_file,
            af1_cutoff = af1_cutoff)

Results

head(res.mix) %>% kable("html")
rsid chr bp a1 a2 af1mix z pval info type
rs117589665 10 104000008 G A 0.0498071 3.7654378 0.0001663 0.9502819 0
rs530689457 10 104000125 T C 0.0025437 -1.5946868 0.1107823 0.1066830 0
rs74469897 10 104000130 A G 0.0019094 -0.3684543 0.7125345 0.0353522 0
rs115917085 10 104000143 G T 0.0017765 -0.5972413 0.5503463 0.0405036 0
rs9664049 10 104000307 T C 0.6636273 -0.4611097 0.6447199 0.9857300 0
rs149691625 10 104000837 T C 0.0046659 -2.7223649 0.0064817 0.0791713 0

Manhattan Plot

res.mix.info <- res.mix %>% filter(info>0.6)
res.mix.info$type <- factor(res.mix.info$type, 
                            levels=c(0, 1),
                            labels=c("Imputed","Measured"))
gwas.sig <- 5*10^-8
  
ggplot(res.mix.info, aes(x = bp, y = -log10(pval), 
                         color = info, shape=type)) +
  geom_point(alpha = 0.8) +
  geom_hline(aes(yintercept = -log10(gwas.sig)), 
             linetype = "dashed", 
             color = "black", 
             size = 1.5) +
  scale_color_gradient(low = "blue", high = "red") +
  labs(x = "Base Pair Position",
       y = "-Log10(p-value)",
       color = "Info",
       shape = "Type") +
  theme_minimal()
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#>  Please use `linewidth` instead.

References

  • Lee et al. DIST: direct imputation of summary statistics for unmeasured SNPs. Bioinformatics . 2013 Nov 15;29(22):2925-7. doi: 10.1093/bioinformatics/btt500.
  • Lee et al. DISTMIX: direct imputation of summary statistics for unmeasured SNPs from mixed ethnicity cohorts. Bioinformatics. https://doi.org/10.1093/bioinformatics/btv348