Entering edit mode
5.6 years ago
vitorgomesbio
•
0
Hi!
I recently started to study bioinformatics and need help. I assembled and annotated eleven bacterial genomes. After the annotation, I came across thousands of contigs. When I ran BLASTn, I realized that each contig made many different alignments. How can I identify the correct strain of my bacteria? Should I just select one single contig or is there any tool to merge them into just one sequence?
Assuming your assembly is valid then the top hit for each contig should give you a good idea of what genome (at genus level for sure perhaps deeper) the sequence belongs to. If you have
thousands
of contigs for 11 genomes then you probably don't have good assemblies. I suggest that you check them with Quast.Thank you!
I trimmed my sequences using the Trim Galore tool. Each one of the eleven genomes presented thousands of contigs even after triming. Is there any problem going on then?
For example, the first contig of one of my genomes resulted in several alignments. The first one was this:
Select seq CP003295.1 Streptococcus infantarius subsp. infantarius CJ18, complete genome 48348 (max score) 72062 (total score) 99% (query cover) 0.0 (E.value) 96% (Ident)
can I infer that this is my strain?
If majority of the contigs consistently show hits to
Streptococcus infantarius
for that one sample then it can be a reasonable conclusion. You would want to use a tool like Mauve to see how your contigs align to the reference (if one is available) and how many holes/gaps you still have in your sequence.Trimming sequences is only first step towards assembly. If you are not getting reasonable assemblies then there are multiple possibilities. You may have non-comprehensive/under-represented libraries. You may also have too much sequence coverage (it may sound odd but having really deep coverage also leads to problematic assemblies). You will have to down-sample your data before assembling in that case. Can you tell us if one or the other is the case here?
Were these strains sequenced/assembled independently?
Are the eleven genomes from eleven isolated cultures? You may also have contamination, ckeck BlobTools, it is a useful tool for both helping identify the species, and detect possible contaminants.
If you've got that many contigs, it suggests your sequencing quality and/or assembly wasn't good to start with. It sounds a lot like contamination. Proceed with caution, if you plan to do more with this data. Even just annotating is probably more than poor data justifies.