Biostar Beta. Not for public use.
Question: Simplest output format of blast contamination check
0
Entering edit mode

Dear all hi,

I have finalsed a de-novo assembly of a uncharacterised organism. To be able to check bacteria etc contamination we would like to use blast NT library.

When I try blast with default settings, I simply get 33Gb of data which I dont need it. My aim is to get single percentage output such like

Bacteria —> %33 similarity.

I have also attempted to get this result with megablast run with default settings. It did not help me neither.

Is this kind of summarised result possible ?

Thank you very much for the help,

Best regards,

Tunc.

ADD COMMENTlink 2.0 years ago morovatunc • 400 • updated 2.0 years ago h.mon 25k
Entering edit mode
2

Try Kraken, its faster and can generate the kind of output you are asking for.

ADD REPLYlink 2.0 years ago
pld
4.8k
Entering edit mode
0

In what form/format is this assembly? Contigs? Number of them? Why are you getting 33G of data? How you are running your blast?

ADD REPLYlink 2.0 years ago
genomax
68k
Entering edit mode
0

The assembly is ~900 MB size (in fasta). The length is .9 Gbps.

I tried;

megablast -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out
blastn  -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out
ADD REPLYlink 2.0 years ago
morovatunc
• 400
Entering edit mode
3

Make your blast more stringent. Look for long alignments since most of the short ones are likely going to be spurious/by chance.

That said you are on a slippery slope here. If contamination was suspected in your original data then it should have been taken care of before doing the assembly. Now you can not be sure if some of the sequence in your assembly should not be there in first place.

ADD REPLYlink 2.0 years ago
genomax
68k
Entering edit mode
0

Yeah yeah you are totally right about the precaution beforehand. Currently, I am just doing this to check to find the most obvious ones.

I dont want to mess with the default word size and mismatch levels. Is there way that I can make this tuning simpler?

ADD REPLYlink 2.0 years ago
morovatunc
• 400
Entering edit mode
1

Check the BlobTools docs for a suggestion of blast parameters. In particular, set stringent e-value cutoff, use a custom format output, and restrict the number of target sequences and hsps returned (there is a somewhat hidden side-effect of -max_target_seqs, though):

-outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-25
ADD REPLYlink 2.0 years ago
h.mon
25k
2
Entering edit mode

When I try blast with default settings, I simply get 33Gb

What kind of output? Xml? Have you tried tabular format? How can you get an output that is probably bigger than your assembly by at least an order of magnitude?

For simple taxomomic identification of contigs, have a look at Kraken (very memory-intensive), Centrifuge or CLARK. However, for taxonomic contamination check, BlobTools is really nice, more helpful than blast alone.

ADD COMMENTlink 2.0 years ago h.mon 25k
Entering edit mode
0

For the runs please see the reply on top.

I have taken a look to blobtools paper. It seems to mine output of blast to make the graphs actually.

blastn -outfmt "6 qseqid staxids bitscore std" -query kefal.contigs.fasta -db /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out ( this is from the bloptools methods)

I just tried to find a some kind of output to solve this problem instead of implementing couple tools. Thank you for the answer and I will definitely give a try to the your suggestions.

ADD REPLYlink 2.0 years ago
morovatunc
• 400

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0