Biostar Beta. Not for public use.
Blast locally on multiple fasta files with multiple database
0
Entering edit mode
16 months ago
fec2 • 20

Hi all,

I need to run blast locally on multiple fasta files contain in a directory. Previously, I have used command below:

ls *.fasta | parallel -a - blastp -query {} -db my_database -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {.}.tsv

Since I need to do pairwise comparison and need to blast all fasta files against each other, the command above will need to performe multiple time because I have to specify my database. Is it possible to have a command that is able to blast all fasta file against each other? And generate output file with file name that combined the name of my database plus the query file name?

Thank you.

sequence genome • 211 views
ADD COMMENTlink
0
Entering edit mode

Small nitpick but

Since I need to do pairwise comparison and need to blast all fasta files against each other

does not match the title of this post.

Blast locally on multiple fasta files with multiple database

May want to edit one or other.

ADD REPLYlink
0
Entering edit mode
  • the command above will need to performed multiple time because I have to specify my database
  • And generate output file name that combined of the name of my database
  • -db mydatabase
ADD REPLYlink
0
Entering edit mode

Hi,

Thank you. Any suggestion for me to change the title of the post?

ADD REPLYlink
0
Entering edit mode

Thanks all, the command work well!

ADD REPLYlink
2
Entering edit mode
16 months ago
SMK ♦ 1.3k
Ghent, Belgium

Something like this:

#!/bin/bash
set -euo pipefail

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      # Before blastp, check if exist or create database for ${s} (I leave this to you)
      blastp -query ${q} -db ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv
    fi
  done
done
ADD COMMENTlink
2
Entering edit mode

This is NOT doing pairwise blast comparisons. One would need to use

blastp -query ${q} -subject ${s} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out ${q}_2_${s}.tsv

with fasta format files.

ADD REPLYlink
0
Entering edit mode

Thanks, genomx for introducing the -subject argument. Will blastp -query ${q} -subject ${s} build a db for ${s} each time? Or it's without the use of db (how's tradeoff for big fasta file?)

Because as I commented in the response:

Before blastp, check if exist or create database for ${s} (I leave this to you)

which assumed there is a db with prefix ${s}, or build the db if not exist, only the first time.

ADD REPLYlink
1
Entering edit mode

Other people already asked about possible differences between -subject and -db:

Different results with BLAST depending on if subject is formatted database or FASTA-file

blast - difference between database and subject

The first post indicates it might result in very different output due to how blast parses the subject in each case, though I would like to see confirmation before believing (and I am feeling lazy to test this today, maybe another day).

ADD REPLYlink
0
Entering edit mode

-subject allows direct comparisons of two sequence files against each other. No database needed.

ADD REPLYlink
1
Entering edit mode
17 months ago
h.mon 25k
Brazil

I will borrow from SMK loop above to provide a GNU parallel solution.

First save the all vs all filenames into a file:

for q in *.fasta; do
  for s in *.fasta; do
    # Ignore self-comparison
    if [[ "${q}" != "${s}" ]]; then
      echo "${q} ${s}" >> allpairs.txt
    fi
  done
done

Then cat the filenames into parallel to build the multiple blast commands:

cat pairs.txt | parallel --colsep ' ' -j 1 \
    'blastp -query {1} -subject {2} -evalue 0.00001 -qcov_hsp_perc 50 -outfmt 6 -max_target_seqs 1 -out {1.}_vs_{2.}.tsv'

If you have multiple processors cpu cores, you can use -j n to perform n blast searches simultaneously. You don't need to build blast databases, blast can perform fasta vs fasta comparison with -query file1 -subject file2 , but in case each fasta file is large, then it would be best if you build the databases beforehand, and replace -subject {2} by -db {2.}.

edits / comments:

  • in view of my comment above, you should check if indeed -subject and -db behave the same way or not.
  • with older blast versions, scaling wasn't very good and it was faster to perform n parallel searches than using -num_threads n. This may have improved in recent blast versions (current is BLAST+ 2.9.0), but I didn't test.
  • keep in mind if you run several blast searches simultaneously with GNU parallel, memory usage will increase correspondingly compared to just one blast search.
ADD COMMENTlink
0
Entering edit mode

Tried a query against the human genome. Was pretty quick. So does not look like OP will need to build databases.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1