Biostar Beta. Not for public use.
How to limit/filter BLAST results using NcbiblastnCommandline
0
Entering edit mode
2.2 years ago
yarmda • 0

I am using a command like

NcbiblastnCommandline(query="fasta.fasta", db=nt, outfmt=5, out="out.xml", show_gis=True)


And that is returning results.

However, I want to limit the results to only returning the gi number/taxID (not the alignment or the bulk of the data) and to limit the hitlist size (as you would in NCBIWWW) to some arbitrary number. How can these be done?

Ultimately, I am trying to find related species to a given target that aren't the target and download their sequences. Since BLAST can't provide the complete genomes, I want to take their identifiers, so I don't need most of the BLAST output.

1
Entering edit mode

I'm assuming you're running blast in Python, because you have some upstream/downstream code that do other processes. However, it would be much easier to run the blast on the CLI, or use subprocess() in Python to call your blast cmd. Output format 6, tab-delimited, is the easiest IMO to parse. You can set max_target_seqs to a value for each query sequence to return that number of db hits.

0
Entering edit mode
19 months ago
France/Nantes/Institut du Thorax - INSE…

However, I want to limit the results to only returning the gi number/taxID

using bioalcidae to extract the hit-it, then call efetch to get the XML for the sequence, and use xpath to extract the taxon id/name.

java -jar dist/bioalcidae.jar -F BLAST -e 'while(iter.hasNext()) {var hit=iter.next(); out.println(hit.getHitId());} ' input.blastn.xml |\
grep '^gi|' | cut -d '|' -f 2 | sort | uniq | while read ID
do
wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=${ID}&retmode=xml&rettype=fasta" |\ xmllint --xpath$'concat(TSeqSet/TSeq/TSeq_taxid/text(),"|",TSeqSet/TSeq/TSeq_orgname/text(),"\n")' -
done | uniq | sort |uniq