Skip to contents

Introduction

As genome-wide association studies (GWAS) continue to expand in scope and diversity, the precise quantification of a study’s ancestral makeup has become increasingly important. The afmix() function of the GAUSS package, provides users with a robust method for accurately estimating these ancestry proportions within multi-ethnic GWAS, utilizing solely summary statistics data. This vignette offers a step-by-step guide on how to effectively employ the afmix() function.

Overview of afmix()

The afmix() function requires four specific arguments:

  • input_file: File name of the allele frequency data
  • reference_index_file: File name of reference panel index data
  • reference_data_file: File name of reference panel data
  • reference_pop_desc_file: File name of reference panel population description data

Upon execution, afmix() function return a data.frame comprising three columns:

  • sup.pop: super population name
  • pop: population name
  • wgt: population proportion

Preparing the Input Data

Allele Frequency Data

The allele frequency data file should be formatted as a space-delimited text file, containing six specific columns: rsid (SNP ID), chr (chromosome number), bp (base pair position), a1 (reference allele), a2 (alternative allele), and af1 (reference allele frequency). In this vignette, we will be using allele frequency data of Chromosome 22 from the Psychiatric Genomic Consortium’s Phase 2 Schizophrenia (PGC SCZ2) GWAS.

Below, we specify the path to the allele frequency data file.

# Path to the input file
input_file <- "../data/PGC2_Chr22_ilmn1M_AF1.txt"

# Input file should include six columns (rsid, chr, bp, a1, a2, and af1)
input.data <- fread(input_file, header = TRUE)
head(input.data)
#>         rsid chr       bp a1 a2       af1
#> 1: rs1000427  22 36890105  A  G 0.1159800
#> 2: rs1000470  22 24026845  A  C 0.1240050
#> 3: rs1000539  22 20202729  A  G 0.2660650
#> 4:   rs10009  22 22051709  A  G 0.6319500
#> 5: rs1001022  22 26403488  T  C 0.9570250
#> 6: rs1001213  22 34131736  A  G 0.0667805

Reference Panel Data Files

Next, we assign the paths to reference panel data files. In this example, we use the 33KG reference panel.

# Paths to the reference files (replace these with your actual paths)
reference_index_file <-"../ref/Human/33KG/33kg_index.gz"
reference_data_file <- "../ref/Human/33KG/33kg_geno.gz"
reference_pop_desc_file<-"../ref/Human/33KG/33kg_pop_desc.txt"

Running afmix()

With the necessary arguments and data files in place, we are ready to run the afmix() function to compute ancestry proportion of PGC SCZ2 GWAS.

wgt.df <- afmix(input_file=input_file,
                reference_index_file = reference_index_file,
                reference_data_file = reference_data_file,
                reference_pop_desc_file = reference_pop_desc_file)

Results: Estimated Ancestry Proportions

Here, we display the estimated ancestry proportions in a HTML table.

wgt.df %>% kable("html")
sup.pop pop wgt
AFR ACB 0.006
AFR ASW 0.036
SAS BEB 0.005
ASN CCE 0.008
ASN CCS 0.004
ASN CDX 0.018
EUR CEU 0.165
AMR CLM 0.025
ASN CNE 0.003
ASN CSE 0.012
EUR FIN 0.138
EUR GBR 0.165
SAS GIH 0.006
EUR IBS 0.099
ASN JPT 0.011
ASN KHV 0.017
AMR MXL 0.030
EUR ORK 0.166
SAS PJL 0.016
AMR PUR 0.045
EUR TSI 0.086

Summarizing Ancestry Proportions by Super Population

Here, we calculate the total proportion for each super population and present the results in a table.

wgt.df %>%
  group_by(sup.pop) %>%
  summarise(wgt=sum(wgt), .groups="drop") %>%
  kable("html")
sup.pop wgt
AFR 0.042
AMR 0.100
ASN 0.073
EUR 0.819
SAS 0.027

References