Skip to contents

Introduction

Transcriptome-Wide Association Studies (TWAS) serve as a powerful framework for identifying the functional relationships between genetic variations and complex traits. By integrating results from Genome-Wide Association Studies (GWAS) with rich functional annotations, TWAS provides a more comprehensive view of the genetic architecture of traits and diseases. GAUSS offers a robust suite of tools to streamline and enhance TWAS analyses. Two of its specialized functions, jepeg() and jepegmix(), offer tailored solutions for homogeneous and multi-ethnic cohorts, respectively. This vignette aims to be your comprehensive guide, walking you through the hands-on application of jepeg() and jepegmix() in conducting TWAS.

Load necessary packages

Preparing the Input Data

Both jepeg() and jepegmix() functions necessitate three core input datasets:

  • Association Z-score Data
  • SNP Annotation Data
  • Reference Panel Data

Association Z-score Data

This data should be in the form of a space-delimited text file including six columns with the following names:

  • rsid: SNP ID
  • chr: Chromosome number
  • bp: Base pair position
  • a1: Reference allele
  • a2: Alternative allele
  • z: Association Z-score

For this example, we will use the Psychiatric Genomic Consortium’s Phase 2 Schizophrenia (PGC SCZ2) GWAS dataset.

# Path to the input file
input_file <- "../data/PGC2_Chr22_ilmn1M_Z.txt"

# Input file should include six columns (rsid, chr, bp, a1, a2, and z)
input.data <- fread(input_file, header = TRUE)
head(input.data)
#>         rsid chr       bp a1 a2          z
#> 1: rs1000427  22 36890105  A  G -1.4969741
#> 2: rs1000470  22 24026845  A  C  1.8407377
#> 3: rs1000539  22 20202729  A  G  2.1683772
#> 4:   rs10009  22 22051709  A  G -0.0556759
#> 5: rs1001022  22 26403488  T  C  2.9694844
#> 6: rs1001213  22 34131736  A  G -0.8957244

SNP Annotation Data

The SNP Annotation Data should be placed in an accessible directory, as shown below:

# Paths to the annotation file (replace these with your actual paths)
annotation_file <- "../data/JEPEG_SNP_Annotation.v1.0.txt"

Reference Panel Data

We will be using the 33KG reference panel. Make sure to replace the file paths below with those of your actual reference panel files.

reference_index_file <-"../ref/Human/33KG/33kg_index.gz"
reference_data_file <- "../ref/Human/33KG/33kg_geno.gz"
reference_pop_desc_file<-"../ref/Human/33KG/33kg_pop_desc.txt"

The jepeg() function

The jepeg() function enables researchers to perform TWAS by assessing the joint effect of expression quantitative trait loci (eQTLs) and other functional variants within a specific gene on phenotype. It requires an SNP annotation file that provides a linkage between SNPs and their corresponding gene-level functional characteristics. By integrating this data, jepeg() formulates gene-level test statistics, aiming to pinpoint genes whose expression levels might be modulated by genetic variations.”

Arguments

The jepeg() function requires the following arguments:

  • study_pop: Study population group
  • input_file: File name of the association Z-score data
  • annotation: File name of the SNP annotation data set
  • reference_index_file: File name of reference panel index data
  • reference_data_file: File name of reference panel data
  • reference_pop_desc_file: File name of reference panel population description data
  • af1_cutoff: Cutoff of reference allele (a1) frequency

Outputs

The jepeg() function returns a data frame with following columns:

  • geneid: Gene name.
  • chisq: Chi-square test statistic value
  • df: Degrees of freedom
  • jepeg_pval: JEPEG p-value
  • num_snp: Number of functional SNPs associated with gene
  • top_categ: Top functional category
  • top_categ_pval: Top category p-value
  • top_snp: Top functional SNP ID
  • top_snp_pval: Top functional SNP p-value

Example Usage

In this example, we will use the jepeg() function to perform a TWAS analysis utilizing summary statistics from the PGC SCZ2 GWAS dataset. Assuming our study cohort consists of participants of British descent, we will specify study_pop = "GBR". This will ensure the function employs genotype data for individuals from Britain, specifically England and Scotland, as sourced from the 33KG reference panel.

af1_cutoff = 0.001

res <- jepeg(study_pop = "GBR",
             input_file=input_file,
             annotation_file=annotation_file,
             reference_index_file = reference_index_file,
             reference_data_file = reference_data_file,
             reference_pop_desc_file = reference_pop_desc_file)

res <- res[order(res$jepeg_pval),]

jepeg() Results

head(res) %>% kable("html")
geneid chisq df jepeg_pval num_snp top_categ top_categ_pval top_snp top_snp_pval
85 DPYD 38.41841 1 0.00e+00 1 TRN 0.00e+00 rs3788568 0.0e+00
72 CXCL14 33.98061 1 0.00e+00 1 TRN 0.00e+00 rs133047 0.0e+00
93 EP300 29.29304 1 1.00e-07 1 PFS 1.00e-07 rs20551 0.0e+00
328 WBP2NL 24.71184 1 7.00e-07 2 PFS 7.00e-07 rs2301521 1.0e-07
186 NDUFA6 24.39774 1 8.00e-07 1 PFS 8.00e-07 rs1801311 2.0e-07
332 ZBED4 19.38566 1 1.07e-05 1 PFS 1.07e-05 rs910799 3.9e-06

The jepegmix() function

For studies that involve diverse ethnic backgrounds, jepegmix() extends the utility of jepeg() to accommodate the complexities introduced by ethnic heterogeneity.

Arguments

The jepegmix() function accepts most of the same arguments as jepeg(). However, it replaces the study_pop argument with pop_wgt_df:

  • pop_wgt_df: An R data frame containing population names and corresponding ancestry proportions.

Outputs

The output of jepegmix() is identical to that of jepeg().

Example Usage

Before using jepegmix(), you need to prepare ancestry proportion data. This data should be structured as a data frame containing two columns:

  • pop: Population abbreviation
  • wgt: Corresponding proportion for that population

To estimate these ancestry proportions, the afmix() function can be employed. For a step-by-step guide on this process, refer to the afmix() vignette.

Here, we load pre-generated ancestry proportion data:

# Load the ancestry proportion data
data("PGC2_SCZ_ANC_Prop") # data frame name: PGC2_SCZ_ANC_Prop
head(PGC2_SCZ_ANC_Prop)
#>   pop   wgt
#> 1 ACB 0.006
#> 2 ASW 0.036
#> 3 BEB 0.005
#> 4 CCE 0.008
#> 5 CCS 0.004
#> 6 CDX 0.018

Now, let’s proceed to use jepegmix() to conduct TWAS. Here, we assume the study cohort includes participants from diverse ethnic backgrounds.

af1_cutoff = 0.001

res <- jepegmix(pop_wgt_df = PGC2_SCZ_ANC_Prop,
                input_file=input_file,
                annotation_file=annotation_file,
                reference_index_file = reference_index_file,
                reference_data_file = reference_data_file,
                reference_pop_desc_file = reference_pop_desc_file,
                af1_cutoff = af1_cutoff)

res <- res[order(res$jepeg_pval),]

jepegmix() Results

head(res) %>% kable("html")
geneid chisq df jepeg_pval num_snp top_categ top_categ_pval top_snp top_snp_pval
92 DPYD 38.41841 1 0.00e+00 1 TRN 0.00e+00 rs3788568 0.0e+00
75 CXCL14 33.81350 1 0.00e+00 2 TRN 0.00e+00 rs133047 0.0e+00
101 EP300 29.29304 1 1.00e-07 1 PFS 1.00e-07 rs20551 0.0e+00
361 WBP2NL 24.71140 1 7.00e-07 2 PFS 7.00e-07 rs2301521 1.0e-07
206 NDUFA6 24.39774 1 8.00e-07 1 PFS 8.00e-07 rs1801311 2.0e-07
365 ZBED4 19.38566 1 1.07e-05 1 PFS 1.07e-05 rs910799 3.9e-06

References

  • Lee et al. JEPEG: a summary statistics based tool for gene-level joint testing of functional variants. Bioinformatics, Volume 31, Issue 8, April 2015, Pages 1176–1182, https://doi.org/10.1093/bioinformatics/btu816
  • Lee et al. JEPEGMIX: gene-level joint analysis of functional SNPs in cosmopolitan cohorts. Bioinformatics. 2016 Jan 15; 32(2): 295–297. doi: 10.1093/bioinformatics/btv567