Biostar Beta. Not for public use.
Pilon polishing highly repetitive nanopore assembly
Entering edit mode
15 months ago
treitlis • 30

Hi all,

I would like to ask you for some suggestions for pilon polishing of a canu assembled genome.

Long story short, we try to assemble a chloroplast genome which is extremely repetitive. We have around 3Gbp of nanopore data, and we are unable to make a single contig from the genome Before we try to manually circularize the assembly we wanted to polish it first. I used nanopolish and then I went for pilon, using illumina reads (2x150bp)

However here is the big issue. It seems like pilon manages to confirm just 76% of the data for the biggest contig in the assembly. As I read, bwa reports just the best mapping location in the genome, so if a read maps to multiple location just a single location is reported. Since the genome seems to have a lot of repetitions, it seems like bwa maps the repetitive reads just to a single location, and the rest of the areas in the genome which have these repetitions are not polished.

How I could manage to polish the entire contig, and make bwa (or some other software) to report all mapping locations? I tried to use bbmap, and set it to report all mapping locations and in this way the coverage increased to 99.8% (based on However, bbmap is not a recommended mapper for pilon, and the minid for bbmap it is set to 0.76 (default) so I am worried that this type of mapping can also create a lot of issues for pilon.

I noticed this issue in other data from my genomes. Mostly I noticed it on eukaryotic data where the rRNA sequences are in multiple locations, and some of them are not polished completely, having mismatches to the rRNA sequences which were manually amplified and sequenced by sanger sequencing. If there would be some polymorphism in the genome with the rRNA, I would see this in the sanger sequencing, but there none in our data.

Any suggestions how to deal with this?

Thank you

Entering edit mode

Thank you for your well-written and detailed question. I have slightly adapted your title to make it more specific about what you are asking.

I am not aware of chloroplast assembly, could you elaborate on the size of the contig?

Entering edit mode

Thank you for your quick reply.

The chloroplast genome is probably 500 kbp (I am not sure even now, because I have also nuclear, and bacterial data in the dataset). I have two main contigs one which is 240 kbp and one which is 160 kbp and some other smaller ones for the canu assembly. The number of confirmed bases is 93% for the 240 kbp one and 85% for the 160 kbp one. I made a mistake in the first post. The 76% confirmed based comes from a 280 kbp contig assembled by miniasm. Miniasm does not use corrected reads, but the assembly was previously polished by nanopolish. This contig is actually a fusion of the two contigs from canu, in a way that the canu contigs overhang this contig (this helps me to figure out the assembly, probably). The fact that I have higher amount of confirmed bases with canu suggests that miniasm might do some misassemblies and small insertions? But still I have a decent amount of regions which are not confirmed. Actually I realized this issue when I put together the selected contigs to polish them, and some of them had really low coverage. So I decided to polish individually the two distinct assemblies. Based on the canu, it seems that miniasm does some small insertions which creates a mess in the genome, but I still have plenty of unpolished bases also in the canu assembly contigs. Actually the contig from canu which has 93% confirmed bases, has regions to which other miniasm contigs map and those contigs from miniasm have 99% coverage. It suggests that some repetitive element breaks the contigs in canu, but in miniasm this repetition is in the contig, which could create the discrepancy in coverage.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1