Tutorial: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2
22
Entering edit mode
7 months ago
Kevin Blighe 43k
Republic of Ireland

Note1 - Previous version: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format (old)

Note2 - this data is for hg19 / GRCh37


The tutorial has been updated based on the 1000 Genomes Phase III imputed genotypes. The original tutorial was performed on non-imputed data held at the University of Washington, which is no longer accessible.

Other changes:

  • tutorial now entirely streamlined - all commands, including in R, are now included
  • duplicate variants are now removed with BCFtools, not PLINK (previous Step 6 removed)
  • now only performs PCA (originally, MDS was also performed but never used)
  • no longer using chrX variants (only autosomal variants)
  • new Step 3, indicating how to download the 1000 Genomes GRCh37 reference build

Program requirements:

  • plink > v1.9
  • BCFtools (tested on v 1.3)

Disk space requirements:

  • downloaded data (VCF.gz and tab-indices), ~ 15.5 GB
  • converted BCF files and their indices, ~14 GB
  • binary PLINK files, ~53 GB
  • pruned PLINK binary files, ~ <1 Gb

1, Download the files as VCF.gz (and tab-indices)

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22}; do
    wget "${prefix}""${chr}""${suffix}" "${prefix}""${chr}""${suffix}".tbi ;
done

2, Download 1000 Genomes PED file

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

3, Download the GRCh37 / hg19 reference genome

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.fai ;

gunzip http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz ;

4, Convert the 1000 Genomes files to BCF

  • Ensure that multi-allelic calls are split and that indels are left-aligned compared to reference genome (1st pipe)
  • Sets the ID field to a unique value: CHROM:POS:REF:ALT (2nd pipe)
  • Removes duplicates (3rd pipe)

-I +'%CHROM:%POS:%REF:%ALT' means that unset IDs will be set to CHROM:POS:REF:ALT

-x ID -I +'%CHROM:%POS:%REF:%ALT' first erases the current ID and then sets it to CHROM:POS:REF:ALT

for chr in {1..22}; do
    bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \
    ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

    bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

    bcftools norm -Ob --rm-dup both \
    > ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

    bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;
done

5, Convert the BCF files to PLINK format

for chr in {1..22}; do
    plink --noweb --bcf ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf \
    --keep-allele-order --vcf-idspace-to _ --const-fid --allow-extra-chr 0 --split-x b37 no-fail --make-bed \
    --out ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

6, Exclude variants not on the coding strand

NB - This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

7, Prune variants from each chromosome

--maf 0.10, only retain SNPs with MAF greater than 10%
--indep [window size] [step size/variant count)] [Variance inflation factor (VIF) threshold]

e.g. indep 50 5 1.5, Generates a list of markers in approx. linkage equilibrium - takes 50 SNPs at a time and then shifts by 5 for the window. VIF (1/(1-r^2)) is the cut-off for linkage disequilibrium

mkdir Pruned ;

for chr in {1..22}; do
    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --maf 0.10 --indep 50 5 1.5 \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;

    plink --noweb --bfile ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes \
    --extract Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.prune.in --make-bed \
    --out Pruned/ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes ;
done

8, Get a list of all PLINK files

find . -name "*.bim" | grep -e "Pruned" > ForMerge.list ;

sed -i 's/.bim//g' ForMerge.list ;

9, Merge all projects into a single PLINK file

plink --merge-list ForMerge.list --out Merge ;

10, Perform PCA

plink --bfile Merge --pca

11, Generate plots in R

R

options(scipen=100, digits=3)

# read in the eigenvectors, produced in PLINK
eigenvec <- data.frame(read.table("plink.eigenvec", header=FALSE, skip=0, sep=" "))
rownames(eigenvec) <- eigenvec[,2]
eigenvec <- eigenvec[,3:ncol(eigenvec)]
colnames(eigenvec) <- paste("Principal Component ", c(1:20), sep="")

# read in the PED data
PED <- data.frame(read.table("20130606_g1k.ped", header=TRUE, skip=0, sep="\t"))
PED <- PED[which(PED$Individual.ID %in% rownames(eigenvec)), ]
PED <- PED[match(rownames(eigenvec), PED$Individual.ID),]
all(PED$Individual.ID == rownames(eigenvec)) == TRUE
[1] TRUE

# set colours
#BiocManager::install("RColorBrewer")
require("RColorBrewer")

# from: http://www.internationalgenome.org/category/population/
PED$Population <- factor(PED$Population, levels=c(
        "ACB","ASW","ESN","GWD","LWK","MSL","YRI",
        "CLM","MXL","PEL","PUR",
        "CDX","CHB","CHS","JPT","KHV",
        "CEU","FIN","GBR","IBS","TSI",
        "BEB","GIH","ITU","PJL","STU"))

col <- colorRampPalette(c(
        "yellow","yellow","yellow","yellow","yellow","yellow","yellow",
        "forestgreen","forestgreen","forestgreen","forestgreen",
        "grey","grey","grey","grey","grey",
        "royalblue","royalblue","royalblue","royalblue","royalblue",
        "black","black","black","black","black"))(length(unique(PED$Population)))[factor(PED$Population)]

# generate PCA bi-plots
project.pca <- eigenvec
summary(project.pca)


par(mar=c(5,5,5,5), cex=2.0, cex.main=7, cex.axis=2.75, cex.lab=2.75, mfrow=c(1,2))

plot(project.pca[,1], project.pca[,2], type="n", main="A", adj=0.5, xlab="First component", ylab="Second component", font=2, font.lab=2)
points(project.pca[,1], project.pca[,2], col=col, pch=20, cex=2.25)
legend("bottomright", bty="n", cex=3.0, title="", c("African","Hispanic","East-Asian","Caucasian","South Asian"), fill=c("yellow","forestgreen","grey","royalblue","black"))

plot(project.pca[,1], project.pca[,3], type="n", main="B", adj=0.5, xlab="First component", ylab="Third component", font=2, font.lab=2)
points(project.pca[,1], project.pca[,3], col=col, pch=20, cex=2.25)

biplot

Kevin

Entering edit mode
0

Hi Kelvin, thank you very much for your tutorial. It helps me a lot to plot the PCA-bi plot of 1kg Phrase III samples.

My question is, if I would like to plot my own sample's PCA with 1kg sample's PCA (My samples were genotyped by SNP array. There were around 700k SNPs. I want to check whether my samples came from certain ethical group), shall I directly combine the eigenvec calculated from my sample and eigenvec got from your step 10 and then plot? Can the two eigen vectors comparable?

I ask this question because in my understanding, PC is the linear combination of independent variables. In my sample, PCs are the linear combination of the pruned genotyped SNPs whereas in 1kg samples, their PCs are the linear combination of other pruned (and much more) SNPs. I doubt whether I can directly compare the PCs. If I can't compare my sample's PCs with 1kg's in this way, how can I compare the ethical group of my sample to the reference 1kg samples?

Thank you very much for your possible reply in advance.

ADD REPLYlink 11 months ago
Sunshine n Rain
• 20
Entering edit mode
0

Hey, my apologies for I seem to have missed this. It is better to combine them within PLINK and then perform PCA on the combined dataset. I have done this many times and it usually works, in the sense that sample ethnicities line up with the expected group from 1000 Genomes.

ADD REPLYlink 10 months ago
Kevin Blighe
43k
Entering edit mode
0

Hi Kevin,

Does the following command do a liftover of 1000 genomes to Hg19. I need to do QC on 1000 genomes data before using for LD clumping and as I haven't done this before, I'm a bit unsure. I followed your steps until this point and then proceeded to convert to plink, removed SNPs based on call rate and HWE and will then proceed to use them for LD clumping

for chr in {1..22}; do bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta \ ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz | \

bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' |

bcftools norm -Ob --rm-dup both \
> ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

bcftools index ALL.chr"${chr}".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bcf ;

done

Thanks

ADD REPLYlink 10 months ago
rkl
• 0
Entering edit mode
0

Hey, I am not sure what you mean by 'liftover' in this context (?). If you want to do LD clumping, then you just need to complete steps 1-5. This will take ~1 day, though. The imputed 1000 Genomes data is 'quite' dense.

ADD REPLYlink 10 months ago
Kevin Blighe
43k
Entering edit mode
0

The commands that you have pasted into your comment do the following:

  • bcftools norm -m-any --check-ref w -f human_g1k_v37.fasta - checks that the REF variants in the 1000 Genomes VCF files corresponds to the base in the reference genome, human_g1k_v37.fasta.
  • bcftools annotate -x ID -I +'%CHROM:%POS:%REF:%ALT' - annotates the ID field in the 1000 Genomes VCF files to be, for example, chr1:12345645:A:T
  • bcftools norm -Ob --rm-dup both - the 1000 Genomes data contains duplicate variants, which have to be removed for PLINK
ADD REPLYlink 10 months ago
Kevin Blighe
43k
Entering edit mode
0

Hi Kelvin, thank you very much for your tutorial. It really helped me a lot.

I am trying to check that the 89 individuals I am working with are of European origin by comparing them with 1000 Genomes dataset. However, the data I have for my samples was obtained from targeted sequencing of 1,293 regions. My main concern is whether I have to extract only SNPs falling inside my targeted regions from the 1000 Genomes dataset before merging with my 89 samples or, whether I should use all the information from 1000 Genomes.

Thank you for your possible reply in advance.

ADD REPLYlink 9 months ago
mpinsach
• 0
Entering edit mode
0

Hey, I previously achieved this by building a predictive model from the 1000 Genomes Data and then applying it to my own data, which was also targeted sequencing. See here: A: How to predict individual ethnicity information by using hapmap data

The other way to do it is filter the datasets so that they have the same starting list of variants, and then proceed with the tutorial from part 7. For this to work, it would help if your variants are all labeled by rs ID or as chr:bp:ref:var

ADD REPLYlink 9 months ago
Kevin Blighe
43k
Entering edit mode
0

Thanks so much for this tutorial, Kevin !!

ADD REPLYlink 7 months ago
Mr Locuace
• 90
Entering edit mode
0

¬°De nada!, Mr Locuace

ADD REPLYlink 7 months ago
Kevin Blighe
43k
Entering edit mode
0

Hey Kelvin,

How to plot sub-population ("ACB","ASW","ESN","GWD","LWK","MSL","YRI", "CLM","MXL","PEL","PUR", "CDX","CHB","CHS","JPT","KHV", "CEU","FIN","GBR","IBS","TSI", "BEB","GIH","ITU","PJL","STU") in pca plot instead of "African","Hispanic","East-Asian","Caucasian","South Asian"?

ADD REPLYlink 4 months ago
Abdul Rafay Khan
• 1000
Entering edit mode
1

Hey, you will simply have to modify the input colour vector. You will have to take great care to ensure that the colours line up to the data that is being plotted.

ADD REPLYlink 4 months ago
Kevin Blighe
43k

Login before adding your answer.

Powered by the version 1.5