I am implementing a DNA alignment tool and I want to test it for correctness and performance against the best known tools. The basic alignment problem seems to be "here is this fastq file that we just got out of the sequencing machine; what parts of it align with this known fasta database?" One canonical fasta database seems to be the recent version of the human genome: hg19.fa. I don't want to use any proprietary data; are there public raw sequencing results that I can use somewhere to test against it? What is a good measure of a quality match? The output of BLAST? I'm willing to run a slow tool once to get a quality result that I can then use as a benchmark.
I am rather sure that my code will be faster.
Good luck Daniel. Let me know if you want beta testers.
What I really need is (a) ONE REAL (not artificial) input to use to align against, say, hg19.fa and (b) a known really high quality expected output. If the best quality output can be gotten from a pubic tool, such as blast, then some advice on what parameters to pass to blast would help so I can run it myself.
complete genomics has some new very very low error rate genomes. Not sure when they will be available. As for a positive control (where the reads should go) I think the best you could do is: BWA or Bowtie 2.
http://www.completegenomics.com/news-events/press-releases/Complete-Genomics-Announces-New-Technology-Developed-to-Set-Standard-for-Clinical-Grade-Genomes-161946595.html