Question

Reference based assembly

1

Entering edit mode

5.5 years ago

deepti1rao ▴ 50

I'm trying to do a reference-aided assembly of a new variety of rice genome. I have mapped my Illumina reads to the reference and replaced the uncovered bases with Ns. I have now used this masked genome fasta file as a reference to map my reads once again. I wish to pull out the variants by doing so and replacing them in the masked fasta file to generate my assembly. I have 96-97% of reads mapping to my reference. Is this a good strategy? I'm a bit in doubt, because I think that the Ns in the masked genome may cause errors in mapping.

Alternatively, shall I extract variants from the bam files that I got by mapping reads to the original (reference) genome file and have them replaced in the masked genome , only if the masked genome, does not have an N at that position?? If yes, then how should I go about this? I have made a bed file of the uncovered loci.

Reference Assembly • 4.8k views

ADD COMMENT • link updated 5.5 years ago by jean.elbers ★ 1.7k • written 5.5 years ago by deepti1rao ▴ 50

score 0 · Answer 1 · 2018-10-10

I have 96-97% of reads mapping to my reference.

It seems the reference assembly is already good enough and pretty similar to your newly sequenced strain, so I wonder why do you need to perform the reference-based assembly. Regardless of the strategy you take, the reference will be of higher quality than your assembly, and by performing reference-based assembly, you may introduce artifacts into your reference.

score 0 · Answer 2 · 2018-11-06

There are other approaches that I might suggest instead of mapping reads to a reference genome. You can start with de novo assembly, and then scaffold your assembly with the help of a reference genome. For example, ragout does use reference genomes for scaffolding, but it can use multiple references. Doing that can account for structural variation present among different genomes.

score 0 · Answer 3 · 2018-11-06

You might consider Reference-guided de novo assembly (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1911-6 ). Before trying though, you should read the paper and look at the improvement for each assembler in de novo versus reference-guided de novo mode. There is a convenient script for each assembler on BitBucket (https://bitbucket.org/HeidiLischer/refguideddenovoassembly_pipelines ), but the scripts do not support starting and stopping at specific steps and also do not use gzipped FASTQs (so if you have limited hard drive space, you would need to modify). The scripts also do not delete temporary files (again a problem if you have a limited storage). Finally, if you only have short-insert libraries and no mate-pair libraries, then I don't think this approach will be a substantial improvement from de novo.