Thousands of contigs in E.coli assembly
0
0
Entering edit mode
5.6 years ago

Hi!

I recently started to study bioinformatics and need help. I assembled and annotated eleven bacterial genomes. After the annotation, I came across thousands of contigs. When I ran BLASTn, I realized that each contig made many different alignments. How can I identify the correct strain of my bacteria? Should I just select one single contig or is there any tool to merge them into just one sequence?

assembly contigs genome prokka • 1.8k views
ADD COMMENT
1
Entering edit mode

Assuming your assembly is valid then the top hit for each contig should give you a good idea of what genome (at genus level for sure perhaps deeper) the sequence belongs to. If you have thousands of contigs for 11 genomes then you probably don't have good assemblies. I suggest that you check them with Quast.

ADD REPLY
0
Entering edit mode

Thank you!

I trimmed my sequences using the Trim Galore tool. Each one of the eleven genomes presented thousands of contigs even after triming. Is there any problem going on then?

ADD REPLY
0
Entering edit mode

For example, the first contig of one of my genomes resulted in several alignments. The first one was this:

Select seq CP003295.1 Streptococcus infantarius subsp. infantarius CJ18, complete genome 48348 (max score) 72062 (total score) 99% (query cover) 0.0 (E.value) 96% (Ident)

can I infer that this is my strain?

ADD REPLY
0
Entering edit mode

can I infer that this is my strain?

If majority of the contigs consistently show hits to Streptococcus infantarius for that one sample then it can be a reasonable conclusion. You would want to use a tool like Mauve to see how your contigs align to the reference (if one is available) and how many holes/gaps you still have in your sequence.

I trimmed my sequences using the Trim Galore tool. Each one of the eleven genomes presented thousands of contigs even after triming. Is there any problem going on then?

Trimming sequences is only first step towards assembly. If you are not getting reasonable assemblies then there are multiple possibilities. You may have non-comprehensive/under-represented libraries. You may also have too much sequence coverage (it may sound odd but having really deep coverage also leads to problematic assemblies). You will have to down-sample your data before assembling in that case. Can you tell us if one or the other is the case here?

Were these strains sequenced/assembled independently?

ADD REPLY
0
Entering edit mode

Are the eleven genomes from eleven isolated cultures? You may also have contamination, ckeck BlobTools, it is a useful tool for both helping identify the species, and detect possible contaminants.

ADD REPLY
0
Entering edit mode

If you've got that many contigs, it suggests your sequencing quality and/or assembly wasn't good to start with. It sounds a lot like contamination. Proceed with caution, if you plan to do more with this data. Even just annotating is probably more than poor data justifies.

ADD REPLY

Login before adding your answer.

Traffic: 2573 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6