Question

Find binding sites start and end positions in genome using biophyton

0

Entering edit mode

5.1 years ago

PaSua • 0

Hello everyone, Python newby here.

I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.

Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.

Thank you in advance.

python biopython binding sites position • 1.9k views

ADD COMMENT • link updated 5.1 years ago by Biostar 20 • written 5.1 years ago by PaSua • 0

0

Entering edit mode

If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2

Edit : Lactococcus lactis, 2.5M bases

You can give a try to BioPython and check the running time

ADD REPLY • link 5.1 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as fimo from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.

ADD REPLY • link 5.1 years ago by ATpoint 82k

0

Entering edit mode

BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.

(You needn't use BLAST via python, but you could if you wanted to).

ADD REPLY • link 5.1 years ago by Joe 21k

0

Entering edit mode

you can use seqkit locate @ PaSua

Example input:

$ cat test.fa 
>a
TGTAAACCTTTTCATACTEAAGATTTGTAAACCTTTTCATGACCGTAGTGTAAACCTTTTCA
>b
ATCGATGCGATTGTAAACCTTTTCAATGCGATGACTGTAAACCTTTTCA

output:

$ seqkit locate -idp "TGTAAACCTTTTCA" test.fa

seqID   patternName pattern strand  start   end matched
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   1   14  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   26  39  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   49  62  TGTAAACCTTTTCA
b   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   12  25  TGTAAACCTTTTCA

ADD REPLY • link 5.1 years ago by cpad0112 21k