Question

NCBI BLAST on AWS

0

Entering edit mode

6.9 years ago

philipp ▴ 30

Hi,

I need to BLAST 75k sequences against refseg_genomic. I set up an Amazon Server (c3.8xlarge). The instance is running and I uploaded all files of the refseq_genomic database to the /blast/blastdb_custom folder and my sequences in TXT files in the folder /home.

Then I run the following code on my local machine:

import paramiko
from openpyxl import Workbook
from openpyxl import load_workbook

workbook = load_workbook("Sequences.xlsx")
worksheet = workbook.get_sheet_by_name(name = "Sequences")

num_rows = worksheet.max_row

i = 2
k = 1

# Connect to the AWS and BLAST
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh.connect(hostname = "XXX.amazonaws.com", port = 22, username = "ubuntu", key_filename = "AWS_KEY.pem", compress = False)

while i < num_rows + 1:

    SequenceID = worksheet.cell(row = i, column = k).value
    Command = "blastn -query " + str(SequenceID) + ".txt -db refseq_genomic -evalue 1 -word_size 11 -gapopen 5 -gapextend 2 -penalty -3 -reward 2 -max_target_seqs 100 -num_threads 32 -outfmt 5 -out RefSeq-" + str(SequenceID) + ".xml"

    print(Command)

    # Execute command on AWS
    ssh_stdin, ssh_stdout, ssh_stderr = ssh.exec_command(str(Command))

    ssh_stdout.readlines()

    i = i+1

print("Finished")

Unfortunately, only the first command in the loop is started but then the computation runs for hours and whether an output is calculated (although the output file is created) nor is there an error message. I think with the computational power BLASTing a sequence should not take longer than a few seconds. Probably something went wrong but I don't know what.

I appreciate any hints what I have to change to get the BLAST running.

Thanks! Philipp

NCBI BLAST AWS • 2.3k views

ADD COMMENT • link 6.9 years ago by philipp ▴ 30

0

Entering edit mode

I don't know a lot of python, but is there a chance "while i < num_rows + 1" is doing the < before the +1, and converting while(false+1) into an infinite loop?

Anyway, why not log into the remote instance with a terminal and look at what it's doing instead of using any local python at all?

ADD REPLY • link 6.9 years ago by karl.stamm 4.1k

0

Entering edit mode

I would ssh in and run the blast with a set ID. For example:

SequenceID = worksheet.cell(row = 2, column = k).value

That could tell you if the blast works?

Just curious why you're looping? For each call to blastn the whole database is loaded. It might be faster if cat your inputs and use that the query. You can then split/ parse the XML as needed.

ADD REPLY • link 6.9 years ago by Jake Warner ▴ 830

0

Entering edit mode

I'm very curious what you mean with "cat your inputs and use that the query". Could you briefly elaborate that? I have 75k sequences that I have to blast against the refseq_genomic database. I only know the sequence but have no further information like species etc.

For me (not an IT specialist) the logical approach was to loop through all the sequences. If you have a better solution that would be great since I'm stuck at this point for quite a while now.

Thanks!

ADD REPLY • link 6.9 years ago by philipp ▴ 30

0

Entering edit mode

Hi Philipp, It appears that you have one text file per sequence since you are looping over the files with + str(SequenceID) + ".txt. I am assuming that these are in fasta format:

">Sequence_iD

ATCGCCGA..."

If that's the case, you could concatenate (cat) the files together into one multifasta, then you can pass this file to Blast.

ADD REPLY • link 6.9 years ago by Jake Warner ▴ 830