Dear Friends,
I am running Mira contig assembler on a iontorrent sequenced bacteriophage fastq file. The total number of reads in the fastq file is about 1800000; the average read length in the fastq file is 300, and the reference genome is unknown. To run the program efficiently, I have divided the fastq files into chunks of reads like "fastq1.fastq: has 10000 reads" etc. At present, among the fastq files I generated from the main fastq file, I am experimenting how many reads fastq file will give a better resultant contig. Ideally the best result should be just 1 contig. Could you please tell me how many reads I should run (given the information I have as aforementioned) to get the best resultant contig? Thanks much!
Since you are working with a phage (assuming your DNA is pure phage) you are going to have a large amount of data which will oversample the DNA. Having too much coverage is not good to get good assemblies. You can either follow the method of incrementally adding reads or use a normalization method to intelligently look as the entire dataset at the same time.
Thanks genomax! If I have to do normalization, then should I do it on the main fastq file with 180000 reads? or the fastq files generated from the main fastq files with reads like 10000, 20000 etc? I would really appreciate your suggestion on this and the rational behind the selection. Thanks much!
Do the normalization with entire data.