Question

Troubles comparing SNP called on Illumina reference human genome

1

Entering edit mode

7.2 years ago

wpierrick ▴ 90

Hello everyone,

I run into weird results when calling a list of variants from a gold standard genome NA12878. My goal is to evaluate how well perform two different calling tools, so I aim to produce a vcf containing a list of variants from a known genome and then compare it with the high confidence list produced by Illumina in Platinium Genomes (https://www.illumina.com/platinumgenomes.html) to find the true positive, false positive etc... It is quite similar with the work done in those two papers: http://www.nature.com/articles/srep17875 http://www.nature.com/nbt/journal/v32/n3/full/nbt.2835.html

To do so, I have downloaded the sequences located here : ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/ So from what I understand, it corresponds to one of the sequencing runs for NA12878.

I then have merged the fastq in two files (paired end), trimmed the first 10 bp and aligned them using BowTie and the human genome 38 reference on Ensembl. ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz The output was a bam and a bai. So far so good. I hope.

Now I want to call variants so I compare them with the variants list provided by Illumina for NA12878 at this adress: ftp://platgene_ro@ussd-ftp.illumina.com/2016-1.0/hg38/small_variants/NA12878/NA12878.vcf.gz

After importing the BAM/BAI, I use the annotation file located on Ensmbl : ftp://ftp.ensembl.org/pub/release-87/regulation/homo_sapiens/homo_sapiens.GRCh38.motiffeatures.20161111.gff.gz and run the calling software.

Two problem arise:

The Variant Effect Annotation step tells me the annotation contains no protein gene records. Alright, I can do without this information as long as I have a list of SNPs and indels as an output but that's weird.
The actual SNP/indel output has identified something like 700 SNPs with default parameters. With parameters not stringent at all, I can find up to 3000 SNPs but nothing even remotely close to the millions found by Illumina.

Any comparison at this point is quite irrelevant. I probably have a problem with the various files I input, but can't figure out where is the problem coming from. Any thoughts on this question ? Is my "roadmap" to produce the list of SNPs and compare it to Illumina's one correct?

Thanks a lot for your input!

SNP sequencing • 2.8k views

ADD COMMENT • link 7.2 years ago by wpierrick ▴ 90

0

Entering edit mode

Thanks for all those links; I found them helpful! Although, I've actually run into a lot of trouble determining what reference NIST used for aligning its reads. Their reference links are dead, and the description is extremely vague, so their VCF is useless to me for now. Instead I'm using Illumina's VCF. Note that you can't trivially wget Illumina's files - they require a blank password.

If anyone knows where to download the reference NIST is using for alignment, I'd be happy to hear it... it is NOT this:

ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz

I don't know what that is, but please don't use it for anything. It contains 48 Gbp of sequence, 94% of which is N. For reference, the actual human genome is around 3 Gbp. On the upside, it does compress very well.

On the other hand, Illumina is using HG19 which is very easy to acquire. For some reason they decided to omit chromosome MT, so of course you should not follow their example exactly because they don't seem to follow best practices.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello Brian, Thanks a lot for you reply!

You are completely right for the HG38 file! After checking more in details, it contains 44807691997 N characters out of 48669025517 total characters (!). So if I understand, I should:

Re-run the pipeline starting from the alignment part with Hg19 as a reference (is this version ftp://ftp.ensembl.org/pub/release-75//fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.toplevel.fa.gz OK? The size seems about right)
Call the variants using the same reference genome and ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz as the annotation file
Then use the VCF of the variants found by Illumina for HG19 (at this adresse ftp://platgene_ro@ussd-ftp.illumina.com/2016-1.0/hg19/small_variants/NA12878/ ) to compare it with my variant list with the MT chromosome in mind.

Is this alright? It's really hard not to get confused by the different builds/reference, so I prefer to be sure. Another question, do you know where I can find the variants called for NA12878 by NIST ? I can't find the corresponding vcf anywhere. If the NIST list exists, has anyone compared it with the list produced by Illumina (always for NA12878) ?

Cheers,

EDIT: oh, and on a side note, I found this program to compare VCF, seems very useful : https://github.com/Illumina/hap.py

ADD REPLY • link 7.2 years ago by wpierrick ▴ 90

1

Entering edit mode

From Ensembl description of your file (README file in the directory of GRCh38 link you provided):

TOPLEVEL --------- These files contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

What you need is a set of chromosomes with repeats and low complexity regions masked for alignment. If you want to use alternative haplotypes (ie patches) here is a good tutorial http://gatkforums.broadinstitute.org/gatk/discussion/8017/how-to-map-reads-to-a-reference-with-alternate-contigs-like-grch38

ADD REPLY • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

2

Entering edit mode

I disagree. Masking prior to alignment is a bad idea, if you care about reads mapping correctly. It causes false positives, false negatives, and incorrect mapq calculation.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

1

Entering edit mode

I just corresponded with NIST, and they now have a new location for the reference:

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz

Personally, I recommend mapping to chr1-22 plus X, Y, and MT, and discarding alt/unplaced sequences, which are low-quality and often nonhuman. Whoever is responsible for curating the human reference genome is not doing a good job.

As for your question - I'd stay away from ensembl as they seem to be making a corrupt human reference publicly available, unless I'm misunderstanding something. Illumina's VCFs are very good, but have Illumina biases; NIST has VCFs that list multiple different platforms (on a per-variant basis it lists which platforms support that variant), and are thus probably better.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Thank you both Petr and Brian for your help.

It baffles me that NIST / Ensembl are not curating human reference genome properly. I can't imagine the time lost by research labs across the world because of that issue. Anyway, I will proceed with the new reference you provided. Thanks a lot for the link!

ADD REPLY • link 7.2 years ago by wpierrick ▴ 90

score 1 · Answer 1 · 2017-03-01

1

Entering edit mode

7.2 years ago

Petr Ponomarenko ★ 2.8k

In this phrase "merged the fastq in two files (paired end), trimmed the first 10 bp and aligned them using TopHat and the human genome 38 reference on Ensembl" do you mean that you merged forward and reverse reads from each pair into a single longer read and only then aligned to the reference? If so, the problem is coming from that step, because paired-end reads have a distribution of insert sizes (distance between forward and reverse read) that is coming from DNA shearing. When you align with BWA, Bowtie or other aligners in paired end mode the program estimates that insert length distribution and can align reads better. When you use TopHat on a merged paired end reads it does not estimate insert size and allows breaks in alignment, because it is designed for RNA-seq data where we have read pairs spanning introns that can be kilobases long.

ADD COMMENT • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

1

Entering edit mode

if you meant that forward and reverse read files were concatenated the problem with statistics is present still.

ADD REPLY • link 7.2 years ago by Petr Ponomarenko ★ 2.8k

0

Entering edit mode

Hello Petr. Thank you for your very complete reply. First I aligned the reads using BowTie as it's DNA sequencing data. I have edited my original message, my bad, sorry. About the concatenation, I have merged all the forward strands together and all the reverse ones together. My guess is that I use wrong reference genome/gff or something like that, but can't figure out what exactly. I hope this helps. Cheers

ADD REPLY • link 7.2 years ago by wpierrick ▴ 90