Question

BioPython BLAST Records HSP Sequences

0

Entering edit mode

5.8 years ago

samvanstroud • 0

When using BioPython to BLAST sequences online through the NCBI (https://www.ncbi.nlm.nih.gov/BLAST), we recieve back BLAST record objects.

blast_record = NCBIXML.read(result_handle)

We can access HSPs using (as per the tutorial found at http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc93).

for alignment in blast_record.alignments:
...     for hsp in alignment.hsps:
...          print('****Alignment****')
...          print('sequence:', alignment.title)
...          print('length:', alignment.length)
...          print('e value:', hsp.expect)
...          print(hsp.query[0:75] + '...')
...          print(hsp.match[0:75] + '...')
...          print(hsp.sbjct[0:75] + '...')

My question is what exactly do the hsp.query, hsp.match, and hsp.sbjct strings represent? I would like to access the full sequence of the matched protein. Is this represented by hsp.sbjct? I am confused by the fact that this is found within alignment.hsps which suggests that we are just looking at the region of the protein that matched.

The question is: "what is the best way to access the full sequence of proteins matched through BLAST using BioPython?".

Thanks very much indeed for any help!

biopython blast sequences ncbi • 5.7k views

ADD COMMENT • link updated 5.8 years ago by Gungor Budak ▴ 270 • written 5.8 years ago by samvanstroud • 0

score 3 · Accepted Answer · 2018-06-16

Hi,

hsp.query shows the sequence that you are searching for in the BLAST database
hsp.match shows the matches between the query and the subject
hsp.subject is the matched part of the sequence in the BLAST database.

For example,

I obtained following portion of the HSP90 protein and BLASTed against Non-redundant protein sequences (nr) downloaded the result XML and parsed using the code given in the BioPython section you linked (it also shows what other fields are available for alignments).

301 pmgrgtkvil hlkedqteyl eerrikeivk khsqfigypi tlfvekerdk evsddeaeek
361 edkeeekeke ekesedkpei edvgsdeeee kkdgdkkkkk kikekyidqe elnktkpiwt
421 rnpdditnee ygefyksltn dwedhlavkh fsvegqlefr allfvprrap fdlfenrkkk

I additionally printed hsp.sbjct_start to see or demonstrate this is a local search and it will only show the part they are aligning. Here is the code and its output:

Code:

from Bio.Blast import NCBIXML
with open('J9TJ99ZH015-Alignment.xml') as result_handle:
    blast_record = NCBIXML.read(result_handle)
E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH and 'NP_001017963.2' in alignment.title:
            print('****Alignment****')
            print('sequence:', alignment.title)
            print('length:', alignment.length)
            print('e value:', hsp.expect)
            print(hsp.query[0:75] + '...')
            print(hsp.match[0:75] + '...')
            print(hsp.sbjct[0:75] + '...')
            print('Subject start', hsp.sbjct_start)

Output:

Gungors-MacBook-Pro:Desktop gungor$ python blast.py
****Alignment****
sequence: gi|153792590|ref|NP_001017963.2| heat shock protein HSP 90-alpha isoform 1 [Homo sapiens]
length: 854
e value: 4.29656e-116
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
PMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDKEVSDDEAEEKEDKEEEKEKEEKESE...
Subject start 301

As you see it shows from the part I obtained from NCBI. It doesn't show the complete sequence.

Up to this point, I hope the concepts are more clear. Now, to get the full sequence of the matched protein, you might get the accession number from the alignment and search any database using that accession number either through an API or you download all, say, human protein sequences and parse file and extract the full sequence from there.

For the API part, you may again use BioPython for Entrez utilities (just strip the ID, and put it to efetch, see below):

from Bio import Entrez
Entrez.email = "your@email.com"
handle = Entrez.efetch(db="protein", id="NP_001017963.2", rettype="fasta", retmode="text")
print(handle.read())

And the output is a FASTA file with the full sequence:

NP_001017963.2 heat shock protein HSP 90-alpha isoform 1 [Homo sapiens] MPPCSGGDGSTPPGPSLRDRDCPAQSAEYPRDRLDPRPGSPSEASSPPFLRSRAPVNWYQEKAQVFLWHL MVSGSTTLLCLWKQPFHVSAFPVTASLAFRQSQGAGQHLYKDLQPFILLRLLMPEETQTQDQPMEEEEVE TFAFQAEIAQLMSLIINTFYSNKEIFLRELISNSSDALDKIRYESLTDPSKLDSGKELHINLIPNKQDRT LTIVDTGIGMTKADLINNLGTIAKSGTKAFMEALQAGADISMIGQFGVGFYSAYLVAEKVTVITKHNDDE QYAWESSAGGSFTVRTDTGEPMGRGTKVILHLKEDQTEYLEERRIKEIVKKHSQFIGYPITLFVEKERDK EVSDDEAEEKEDKEEEKEKEEKESEDKPEIEDVGSDEEEEKKDGDKKKKKKIKEKYIDQEELNKTKPIWT RNPDDITNEEYGEFYKSLTNDWEDHLAVKHFSVEGQLEFRALLFVPRRAPFDLFENRKKKNNIKLYVRRV FIMDNCEELIPEYLNFIRGVVDSEDLPLNISREMLQQSKILKVIRKNLVKKCLELFTELAEDKENYKKFY EQFSKNIKLGIHEDSQNRKKLSELLRYYTSASGDEMVSLKDYCTRMKENQKHIYYITGETKDQVANSAFV ERLRKHGLEVIYMIEPIDEYCVQQLKEFEGKTLVSVTKEGLELPEDEEEKKKQEEKKTKFENLCKIMKDI LEKKVEKVVVSNRLVTSPCCIVTSTYGWTANMERIMKAQALRDNSTMGYMAAKKHLEINPDHSIIETLRQ KAEADKNDKSVKDLVILLYETALLSSGFSLEDPQTHANRIYRMIKLGLGIDEDDPTADDTSAAVTEEMPP LEGDDDTSRMEEVD