Biostar Beta. Not for public use.
Parsing BLAST pairwise output in R
0
Entering edit mode
15 months ago
rubic • 180
United States

Hi,

Does anyone know of an R package that parses the pairwise format output of NCBI's blasp - which looks like this:

BLASTP 2.4.0+


Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.


Reference for composition-based statistics: Alejandro A. Schaffer, L. Aravind, Thomas L. Madden, Sergei Shavirin, John L. Spouge, Yuri I. Wolf, Eugene V. Koonin, and Stephen F. Altschul (2001), "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements", Nucleic Acids Res. 29:2994-3005.



Database: mm10 ensembl AA
           61,440 sequences; 26,382,989 total letters



Query= NP_001254519.1 mu-type opioid receptor [Heterocephalus glaber]

Length=400
                                                                      Score     E Sequences producing significant alignments:                (Bits)  Value

  ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000060590 ENSMUST00000056385;ENSMUSG00000000766;Oprm1      753     0.0      ENSMUSP00000077704 ENSMUST00000078634;ENSMUSG00000000766;Oprm1      731     0.0   

. . .

> ENSMUSP00000101232 ENSMUST00000105607;ENSMUSG00000000766;Oprm1 Length=398

 Score = 753 bits (1943),  Expect = 0.0, Method: Compositional matrix adjust.  Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)

Query  1    MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP  60
            MDSS  PGN  +C+DP A +SCS  P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct  1    MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP  58

Query  61   PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT  120
             TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct  59  QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT  118

Query  121  STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  180
            STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct  119  STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  178

Query  181  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI  240
            RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct  179  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI  238

Query  241  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  300
            FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct  239  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  298

Query  301  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  360
            YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct  299  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  358

Query  361  EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP  400
            EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct  359  EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP  398


> ENSMUSP00000090410 ENSMUST00000092734;ENSMUSG00000000766;Oprm1 Length=398

 Score = 753 bits (1943),  Expect = 0.0, Method: Compositional matrix adjust.  Identities = 374/400 (94%), Positives = 383/400 (96%), Gaps = 2/400 (1%)

Query  1    MDSSVLPGNAGNCTDPFAQSSCSLAPSPGSWTNLSHLDGNLSDPCGPNRTELGGSDSRCP  60
            MDSS  PGN  +C+DP A +SCS  P+PGSW NLSH+DGN SDPCGPNRT LGGS S CP Sbjct  1    MDSSAGPGNISDCSDPLAPASCS--PAPGSWLNLSHVDGNQSDPCGPNRTGLGGSHSLCP  58

Query  61   PTGSPSMITAVTIMALYSIVCVVGLFGNFLVMYVIIRYTKMKTATNIYIFNLALADALAT  120
             TGSPSM+TA+TIMALYSIVCVVGLFGNFLVMYVI+RYTKMKTATNIYIFNLALADALAT Sbjct  59  QTGSPSMVTAITIMALYSIVCVVGLFGNFLVMYVIVRYTKMKTATNIYIFNLALADALAT  118

Query  121  STLPFQSVNYLMGTWPFGTILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  180
            STLPFQSVNYLMGTWPFG ILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF Sbjct  119  STLPFQSVNYLMGTWPFGNILCKIVISIDYYNMFTSIFTLCTMSVDRYIAVCHPVKALDF  178

Query  181  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRHGSIDCTLTFSHPTWYWENLLKICVFI  240
            RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYR GSIDCTLTFSHPTWYWENLLKICVFI Sbjct  179  RTPRNAKIVNVCNWILSSAIGLPVMFMATTKYRQGSIDCTLTFSHPTWYWENLLKICVFI  238

Query  241  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  300
            FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI Sbjct  239  FAFIMPVLIITVCYGLMILRLKSVRMLSGSKEKDRNLRRITRMVLVVVAVFIVCWTPIHI  298

Query  301  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  360
            YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI Sbjct  299  YVIIKALITIPETTFQTVSWHFCIALGYTNSCLNPVLYAFLDENFKRCFREFCIPTSSTI  358

Query  361  EQQNSTRIRQNTRDHPSTANTVDRTNHQLENLEAETAPLP  400
            EQQNS RIRQNTR+HPSTANTVDRTNHQLENLEAETAPLP Sbjct  359  EQQNSARIRQNTREHPSTANTVDRTNHQLENLEAETAPLP  398

What I actually need is to find all 100% matches between my query and database sequences (their sequence IDs and the corresponding coordinates).

R blast parse • 1.2k views
ADD COMMENTlink
1
Entering edit mode

If you know R, you're better off re-doing your blast in tab-delimited format (-outfmt 6), and parsing your results based on the parameters you want from there.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3