Question

How to validate the assembly completeness of unaligned reads ?

0

Entering edit mode

5 weeks ago

Sony ▴ 10

Hello everyone,

I am trying to assemble the unaligned reads into de novo contigs (these unaligned reads was extracted from the mapping between paired end reads to reference genome). I tried with 2 de novo assemblers: MaSuRCA and SPAdes.

1. The assembly stats generated by QUAST for MaSuRCA:

 # contigs (>= 0 bp)         6180
# contigs (>= 1000 bp)      1284
# contigs (>= 5000 bp)      8
# contigs (>= 10000 bp)     0
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      4715175
Total length (>= 1000 bp)   2119701
Total length (>= 5000 bp)   47692
Total length (>= 10000 bp)  0
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   3546
Largest contig              6948
Total length                3703670
GC (%)                      39.04
N50                         1110
N90                         600
auN                         1449.9
L50                         1030
L90                         2868
# N's per 100 kbp           0.00

2. The assembly stats generated by QUAST for SPAdes:

contigs (>= 0 bp) 52187
# contigs (>= 1000 bp)      2881
# contigs (>= 5000 bp)      47
# contigs (>= 10000 bp)     1
# contigs (>= 25000 bp)     0
# contigs (>= 50000 bp)     0
Total length (>= 0 bp)      20642697
Total length (>= 1000 bp)   5141662
Total length (>= 5000 bp)   287508
Total length (>= 10000 bp)  12949
Total length (>= 25000 bp)  0
Total length (>= 50000 bp)  0
# contigs                   8583
Largest contig              12949
Total length                9033035
GC (%)                      37.17
N50                         1133
N90                         578
auN                         1612.0
L50                         2293
L90                         6900
# N's per 100 kbp           0.00

In this case, I want to choose the better assembly sequences between these assemblers.

To check the missed-assembly, I tried to remap the assembled contigs against the raw paired end reads. my expectation is to check that all positions are covered by reads.

Based on the mapping stats, there was very few of the reads do not map back to the contigss, 98% of reads are properly paired (same results between these assemblers)

Then, I screen the contamination sequences in my assembled contigs and remove it (by Foreign Contamination Screening FCS-GX NCBI ). The summarised sequences stats in this figure:

enter image description here

Then I tried to check the repetitive sequences in my clean.fasta with RepeatModeler (clean.fasta is generated after removing the contamination sequences in the primary assembly output).

The summary of the repeat sequences is detected: enter image description here

I am wondering which assemblers are correct/complete. The assembled sequences generated by MaSuRCA and SPAdes is significantly different. Are there any suggestions for me to determine which one is really the better assembler in this case ? or any further analysis I can do to confirm that?

Thank you.

MaSuRCA SPAdes • 306 views

ADD COMMENT • link updated 5 weeks ago by GenoMax 141k • written 5 weeks ago by Sony ▴ 10

1

Entering edit mode

The reference genome that you aligned to originally - of what quality is it? Is it a reasonably "finished" genome? If so then the unaligned reads you ended up with may not actually represent much useful in terms of additional information.

Have you tried running BUSCO on the assemblies above to see if there is any real information in there that was missing from the original reference (which you should check with BUSCO as well)?

If you are sure that your "unaligned" reads actually are from the correct genome, perhaps what you should try is to assemble the entire dataset and then compare the resulting assembly to the "reference" genome.

ADD REPLY • link 5 weeks ago by GenoMax 141k

0

Entering edit mode

Thank you Sir for your suggestion.

The reference genome that I used is Genome assembly at the Chromosome level enter link description here .
I used FCS-GX NCBI to screen for the contamination sequences in my assembled sequence (specific search for TaxID 3705 ("Brassica" species). So I believe that the assembled contigs of unmapped reads is from Brassica.
I have not tried running BUSCO yet. But in case I run BUSCO, which sequence should I run ? ("The primary of assembled sequence" which is the original output from assembly " or I have to concatenate the cleanly assembled contigs (this is cleanly assembled contigs after I remove the contamination sequences) with the original reference genome => original reference genome + cleanly assembled contigs" ?

I am a newbie in this field. I am highly appreciated for any suggestions. Thank you.

ADD REPLY • link 5 weeks ago by Sony ▴ 10

0

Entering edit mode

The NCBI genome link you posted is from 2014 but the assembly appears to be at the chromosome level. Based on BUSCO analysis included it seems to be 90% complete for single copy genes.

Did the reference you originally used contain 32876 unplaced scaffolds that are present in the NCBI assembly. If you had not included those pieces then much of what you are assembling from "Unmapped" reads may be already in those scaffolds. If you had included those scaffolds and thus your "unmapped" reads represent real/new sequence how are you going to place those in the full assembly? That is the reason I was suggesting that you try a complete assembly of your entire dataset followed by comparison to the published NCBI assembly.

Why did you think that you had "contamination" in your sequences? If you did not then you could be throwing good sequences away. If you did have "real" contamination then your may have a bigger problem on your hand.

ADD REPLY • link 5 weeks ago by GenoMax 141k