Question

Need A Public Canonical Dna Alignment Problem With Known Good Solution

2

Entering edit mode

11.8 years ago

daniel.wilkerson ▴ 50

I am implementing a DNA alignment tool and I want to test it for correctness and performance against the best known tools. The basic alignment problem seems to be "here is this fastq file that we just got out of the sequencing machine; what parts of it align with this known fasta database?" One canonical fasta database seems to be the recent version of the human genome: hg19.fa. I don't want to use any proprietary data; are there public raw sequencing results that I can use somewhere to test against it? What is a good measure of a quality match? The output of BLAST? I'm willing to run a slow tool once to get a quality result that I can then use as a benchmark.

dna alignment • 2.4k views

ADD COMMENT • link updated 11.8 years ago by matted 7.8k • written 11.8 years ago by daniel.wilkerson ▴ 50

score 2 · Answer 1 · 2012-07-15

2

Entering edit mode

11.8 years ago

Zev.Kronenberg 12k

Just an idea for you: Simulate the data. That way you will known the false positive and false negative rates.

There is a lot of well developed alignment software. Bowtie, Bowtie2, BWA, ect....

Why are you trying to re-invent the wheel?

ADD COMMENT • link 11.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

I am rather sure that my code will be faster.

ADD REPLY • link 11.8 years ago by daniel.wilkerson ▴ 50

0

Entering edit mode

Good luck Daniel. Let me know if you want beta testers.

ADD REPLY • link 11.8 years ago by Zev.Kronenberg 12k

0

Entering edit mode

What I really need is (a) ONE REAL (not artificial) input to use to align against, say, hg19.fa and (b) a known really high quality expected output. If the best quality output can be gotten from a pubic tool, such as blast, then some advice on what parameters to pass to blast would help so I can run it myself.

ADD REPLY • link 11.8 years ago by daniel.wilkerson ▴ 50

1

Entering edit mode

complete genomics has some new very very low error rate genomes. Not sure when they will be available. As for a positive control (where the reads should go) I think the best you could do is: BWA or Bowtie 2.

http://www.completegenomics.com/news-events/press-releases/Complete-Genomics-Announces-New-Technology-Developed-to-Set-Standard-for-Clinical-Grade-Genomes-161946595.html

ADD REPLY • link 11.8 years ago by Zev.Kronenberg 12k

score 0 · Answer 2 · 2012-07-16

0

Entering edit mode

11.8 years ago

Whetting ★ 1.6k

I would agree with @Zev. No need to invest time on re-creating a tool like that

ADD COMMENT • link 11.8 years ago by Whetting ★ 1.6k

1

Entering edit mode

What if my code is an order of magnitude faster? Your thesis seems to be that innovation is not possible. Where is the forum where people work together on innovation in biotechnology? I seem to have wandered onto the Forum of Useless and Unhelpful Remarks.

ADD REPLY • link 11.8 years ago by daniel.wilkerson ▴ 50

score 0 · Answer 3 · 2012-07-17

I agree with the other posters in being skeptical about your claims, particularly given the apparent lack of familiarity with competing tools and evaluation metrics suggested by your questions. However, this is an inference based on limited data, and so it might be wrong. Best of luck nevertheless.

To answer your original question, take a look at the evaluation sections of two recent state-of-the-art tools. They use both simulated and published data (freely downloadable on the SRA, with accession codes given in the papers) to evaluate the performance of their algorithms. You could compare your new method against the competitors shown in Tables 1 and 2 in the BWA paper:

http://bioinformatics.oxfordjournals.org/content/25/14/1754.full

Or similarly with Figures 1 and 2 in the Bowtie 2 paper:

http://www.nature.com/nmeth/journal/v9/n4/full/nmeth.1923.html