Question

Simplest output format of blast contamination check

1

Entering edit mode

6.2 years ago

morovatunc ▴ 550

Dear all hi,

I have finalsed a de-novo assembly of a uncharacterised organism. To be able to check bacteria etc contamination we would like to use blast NT library.

When I try blast with default settings, I simply get 33Gb of data which I dont need it. My aim is to get single percentage output such like

Bacteria —> %33 similarity.

I have also attempted to get this result with megablast run with default settings. It did not help me neither.

Is this kind of summarised result possible ?

Thank you very much for the help,

Best regards,

Tunc.

blast megablast • 2.6k views

ADD COMMENT • link updated 6.2 years ago by h.mon 35k • written 6.2 years ago by morovatunc ▴ 550

2

Entering edit mode

Try Kraken, its faster and can generate the kind of output you are asking for.

ADD REPLY • link 6.2 years ago by pld 5.1k

0

Entering edit mode

In what form/format is this assembly? Contigs? Number of them? Why are you getting 33G of data? How you are running your blast?

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

The assembly is ~900 MB size (in fasta). The length is .9 Gbps.

I tried;

megablast -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out
blastn  -i kefal.contigs.fasta -d /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out

ADD REPLY • link 6.2 years ago by morovatunc ▴ 550

3

Entering edit mode

Make your blast more stringent. Look for long alignments since most of the short ones are likely going to be spurious/by chance.

That said you are on a slippery slope here. If contamination was suspected in your original data then it should have been taken care of before doing the assembly. Now you can not be sure if some of the sequence in your assembly should not be there in first place.

ADD REPLY • link 6.2 years ago by GenoMax 141k

0

Entering edit mode

Yeah yeah you are totally right about the precaution beforehand. Currently, I am just doing this to check to find the most obvious ones.

I dont want to mess with the default word size and mismatch levels. Is there way that I can make this tuning simpler?

ADD REPLY • link 6.2 years ago by morovatunc ▴ 550

1

Entering edit mode

Check the BlobTools docs for a suggestion of blast parameters. In particular, set stringent e-value cutoff, use a custom format output, and restrict the number of target sequences and hsps returned (there is a somewhat hidden side-effect of -max_target_seqs, though):

-outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-25

ADD REPLY • link 6.2 years ago by h.mon 35k

score 2 · Answer 1 · 2018-02-15

2

Entering edit mode

6.2 years ago

h.mon 35k

When I try blast with default settings, I simply get 33Gb

What kind of output? Xml? Have you tried tabular format? How can you get an output that is probably bigger than your assembly by at least an order of magnitude?

For simple taxomomic identification of contigs, have a look at Kraken (very memory-intensive), Centrifuge or CLARK. However, for taxonomic contamination check, BlobTools is really nice, more helpful than blast alone.

ADD COMMENT • link 6.2 years ago by h.mon 35k

0

Entering edit mode

For the runs please see the reply on top.

I have taken a look to blobtools paper. It seems to mine output of blast to make the graphs actually.

blastn -outfmt "6 qseqid staxids bitscore std" -query kefal.contigs.fasta -db /mnt/compgen/inhouse/share/blastdb/nt -o kefal_canu_megablast.out ( this is from the bloptools methods)

I just tried to find a some kind of output to solve this problem instead of implementing couple tools. Thank you for the answer and I will definitely give a try to the your suggestions.

ADD REPLY • link 6.2 years ago by morovatunc ▴ 550