Need A Public Canonical Dna Alignment Problem With Known Good Solution
3
2
Entering edit mode
11.8 years ago

I am implementing a DNA alignment tool and I want to test it for correctness and performance against the best known tools. The basic alignment problem seems to be "here is this fastq file that we just got out of the sequencing machine; what parts of it align with this known fasta database?" One canonical fasta database seems to be the recent version of the human genome: hg19.fa. I don't want to use any proprietary data; are there public raw sequencing results that I can use somewhere to test against it? What is a good measure of a quality match? The output of BLAST? I'm willing to run a slow tool once to get a quality result that I can then use as a benchmark.

dna alignment • 2.4k views
ADD COMMENT
2
Entering edit mode
11.8 years ago

Just an idea for you: Simulate the data. That way you will known the false positive and false negative rates.

There is a lot of well developed alignment software. Bowtie, Bowtie2, BWA, ect....

Why are you trying to re-invent the wheel?

ADD COMMENT
0
Entering edit mode

I am rather sure that my code will be faster.

ADD REPLY
0
Entering edit mode

Good luck Daniel. Let me know if you want beta testers.

ADD REPLY
0
Entering edit mode

What I really need is (a) ONE REAL (not artificial) input to use to align against, say, hg19.fa and (b) a known really high quality expected output. If the best quality output can be gotten from a pubic tool, such as blast, then some advice on what parameters to pass to blast would help so I can run it myself.

ADD REPLY
1
Entering edit mode

complete genomics has some new very very low error rate genomes. Not sure when they will be available. As for a positive control (where the reads should go) I think the best you could do is: BWA or Bowtie 2.

http://www.completegenomics.com/news-events/press-releases/Complete-Genomics-Announces-New-Technology-Developed-to-Set-Standard-for-Clinical-Grade-Genomes-161946595.html

ADD REPLY
0
Entering edit mode
11.8 years ago
Whetting ★ 1.6k

I would agree with @Zev. No need to invest time on re-creating a tool like that

ADD COMMENT
1
Entering edit mode

What if my code is an order of magnitude faster? Your thesis seems to be that innovation is not possible. Where is the forum where people work together on innovation in biotechnology? I seem to have wandered onto the Forum of Useless and Unhelpful Remarks.

ADD REPLY
0
Entering edit mode
11.8 years ago
matted 7.8k

I agree with the other posters in being skeptical about your claims, particularly given the apparent lack of familiarity with competing tools and evaluation metrics suggested by your questions. However, this is an inference based on limited data, and so it might be wrong. Best of luck nevertheless.

To answer your original question, take a look at the evaluation sections of two recent state-of-the-art tools. They use both simulated and published data (freely downloadable on the SRA, with accession codes given in the papers) to evaluate the performance of their algorithms. You could compare your new method against the competitors shown in Tables 1 and 2 in the BWA paper:

http://bioinformatics.oxfordjournals.org/content/25/14/1754.full

Or similarly with Figures 1 and 2 in the Bowtie 2 paper:

http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html

ADD COMMENT
2
Entering edit mode

Thanks for the pointers. On your other point, has it occurred to you that someone might know something about computation/algorithms without knowing much biology?

ADD REPLY

Login before adding your answer.

Traffic: 1441 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6