Question

Verifying Next Gen Seq Source

1

Entering edit mode

12.0 years ago

Chris Simokat ▴ 10

An Immunologist I have started a project with has some next-gen sequencing data from a previous project, but he is uncertain that the data the lab sent back is actually from the samples he sent to them. He is hoping that I can verify their results by analyzing their data. The organism has a reference genome in NCBI. Based on my understanding of FASTA and BLAST (largely based on descriptions from texts like Introduction to Mathematical Methods in Bioinformatics) it seems that FASTA would be able to compare the contents of the FASTQ files from the lab results against the reference FNA or FAA files from NCBI. I have created a local install of galaxy and I could use bioconductor, but I am wondering if I am going about this the right way. So am I or is there another or perhaps better way to go about this verification? Any guidance or references would be greatly appreciated thank you.

comparison fasta bioconductor galaxy reference • 2.0k views

ADD COMMENT • link updated 12.0 years ago by seidel 11k • written 12.0 years ago by Chris Simokat ▴ 10

score 2 · Answer 1 · 2012-04-15

2

Entering edit mode

12.0 years ago

Ryan Thompson ★ 3.6k

If your data is supposed to be human, than you can try mapping to the human reference using a short read mapping program like bwa or bowtie. If a low fraction of the reads map, then you might have a sample from another species. But if the sample got mixed up with another human sample, you won't be able to tell, unless you already have very specific genotyping information for specific loci that you can verify.

ADD COMMENT • link 12.0 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

I know you answered first and while the data wasn't human what you suggested was helpful too. So thank you.

ADD REPLY • link 12.0 years ago by Chris Simokat ▴ 10

score 0 · Answer 2 · 2012-04-16

This is a common problem faced by anyone working with this kind of data - verifying that fastq files contain data for the sample you think they do. Tubes can get swapped (or plates can get rotated) at many places during the generation of the data, and files can get moved around or re-named by mistake. They key is in your statement, "verify their results by analyzing their data." What are their results? And does the content of your fastq files (i.e. the data) support those results?

You didn't mention what kind of experiment it is, so it's hard to comment on how the files can be checked specifically, but in general what you would check would be any predictable property of the files. Ryan mentioned the first and most obvious thing to check - do the majority of reads in your files map to the intended organism? This can be checked with galaxy. The next thing to check may depend on the type of experiment your data represents. If it is genotyping data - then check any locus that is known for your sample. Chances are, if this sample is important to your experimental system, then there are probably several specific loci that you can verify, either a specific SNP, or another mutation of some sort in a gene which give these cells a particular phenotype (e.g. constitutively active p53). Perhaps the cells have been immortalized, or transfected with a construct, thus you can check for reads that would be produced by the construct (e.g. over-expression of p53, RNAi knockdown of p53, etc.). If the experiment is an RNA Seq experiment, you can check many of these same things to verify the genotype of the cells. If you have aligned reads, these things can be checked with a genome browser (trackster in galaxy?, or IGV would be very easy for this), or for specific loci you can easily them with samtools once you have bam files. However, with RNA Seq data you may have to look for something about the gene expression pattern to verify the identity of the fastq files. For instance, are there expressed marker genes that you know should be present in this data set? Does the pattern match the expected cell type or experimental condition of interest? If you have ChIP Seq data, all the genotype stuff can still be checked, but you will also want to verify all the predictable experimental characteristics: are there peaks in your data? Do they look like you expect, and appear in the places you expect them to (e.g. a transcription factor, a histone, RNA Pol II, all have patterns but predictable characteristics).

This problem is actually the first step that many of us take when starting an analysis - make a table of predictable properties of the data (e.g. genotype, marker gene, etc.), and verify that the fastq files map correctly to those properties. I've analyzed several data sets where tubes were swapped, or where I found novel mutations in the genes of interest (oops, we thought the mutation at this position was responsible for our phenotype - we didn't know about this other mutation), or where I found that the mutant wasn't the sequence people thought it was. Check any predictable property. Got a priori? Verify.