Blastp - through R
2
0
Entering edit mode
8.2 years ago

Hi,

I'm supposed to do reciprocal best hit using blastp for a long list of proteins for which I'm given only the UniProtKB/TrEMBL identifier. I have to find the ortholog proteins for each of the given proteins. I also have to find orthologs in a given list of species.

Previously, when I had only one Protein, I would do it through online tool in NCBI, there I can give the identifier and also limit the species. However, with more proteins in hand, this is not feasible anymore.

I downloaded Blast, but I cannot download any databases (memory problem, and I need refseq_protein database).

I'm familiar with R and would like to do it in R, but sending online queries to the NCBI blastp.

I searched a lot and was not able to find hint.

I installed orthologr, biomaRt and some other packages, but they only used local databases.

I would appreciate it a lot if somebody could help me how I can send online query through R. I have an R code and somewhere in the code I need to call blastp to find orthologs of my proteins among the given list of species.

Thanks and looking forward.

Maah

R blast • 6.3k views
ADD COMMENT
0
Entering edit mode
8.2 years ago
Michael 54k

There is no advantage in using R to start a long running blast process, instead just start Blast+ from the commandline using the -remote option, that does what you describe above. R does not have a good blast parser unlike perl and python, you are bound to the tabular formats (6,7, or 10).

ftp://ftp.ncbi.nlm.nih.gov/blast/documents/blastdb.html has an overview of available blast databases. See the manual pages for more help.

For your case the commandline looks somewhat like:

blastp -remote -db refseq_protein -query my.fasta -outfmt 6 > blasttab.txt

or use the -entrez_query parameter instead of -query, if you want to run the search without making your own fasta file, see

http://www.ncbi.nlm.nih.gov/books/NBK3837/

ADD COMMENT
0
Entering edit mode

Thanks a lot for your response.

Is it also possible to give protein sequence identifier instead of fasta file? In the web tool this is possible, like if you enter:

O48946

Is there anyway to specify the species as well?

ADD REPLY
0
Entering edit mode

I think it could work through the -entrez_query parameter but I haven't tried.

ADD REPLY
0
Entering edit mode

I figured this out, no worries for this.

ADD REPLY
0
Entering edit mode

Would you mind please send me the link for python?

I thought you know a good manual on how to do blast in biopython and you could maybe share it with me, just to save time.

Otherwise I know how to search it :)

ADD REPLY
0
Entering edit mode

google:biopython

ADD REPLY
0
Entering edit mode
8.2 years ago

I found that I can send my query like the following through R:

system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', "'refseq_protein'" ,'-entrez_query', "prf||O48946"),stdout=TRUE)

However, I get the following error:

[1] "Error: (301.23) [CONN_Read(blast4/HTTP; http://www.ncbi.nlm.nih.gov/Service/dispd.cgi?service=blast4&address=W7DELL902035.mpimp-golm.mpg.de&platform=i386-pc-win64)]  Unable to read data: Timeout[30.000000]"
[2] "Warning: Error initializing remote BLAST database data loader: Protein BLAST database ''refseq_protein'' does not exist in the NCBI servers"                                                                  
[3] "Command line argument error: Query is Empty!"                                                                                                                                                                 
attr(,"status")
[1] 1
Warning message:
running command '"blastp" -remote -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -db 'refseq_protein' -entrez_query prf||O48946' had status 1

Could anybody help me with this?

ADD COMMENT
0
Entering edit mode

You can do it that way, calling via R, but here you can exactly see the problem: its getting too complicated for a beginner. You have additional quotes here around the db which should be removed, again there is no advantage in using system in R over typing this into your console first.

ADD REPLY
0
Entering edit mode
system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', "'refseq_protein'" ,'-entrez_query', "prf||O48946"),stdout=TRUE)
________________________________________________________________________________________________^^______________^^

should be:

system2('blastp',c('-remote', "-best_hit_overhang", "0.1", "-best_hit_score_edge", "0.1",'-db', 'refseq_protein' ,'-entrez_query', "prf||O48946"),stdout=TRUE)
________________________________________________________________________________________________^______________^

Hint: just make it a custom to use only one type of quotes throughout, either single or double, in R there is no semantic difference between the different quote types, unlike e.g. in perl. quoting errors are sometimes hard to spot.

ADD REPLY
0
Entering edit mode

Thanks a lot. You are right with respect to the quotation. I found this code in Biostar and I thought quotations should differ for the sake of blast query.

ADD REPLY
0
Entering edit mode

Where did you find it? Then its possibly also wrong there.

ADD REPLY

Login before adding your answer.

Traffic: 2674 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6