Biostar Beta. Not for public use.
Find binding sites start and end positions in genome using biophyton
0
Entering edit mode
19 months ago
PaSua • 0

Hello everyone, Python newby here.

I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.

Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.

Thank you in advance.

ADD COMMENTlink
0
Entering edit mode

If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2

Edit : Lactococcus lactis, 2.5M bases

You can give a try to BioPython and check the running time

ADD REPLYlink
0
Entering edit mode

Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as fimo from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.

ADD REPLYlink
0
Entering edit mode

BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.

(You needn't use BLAST via python, but you could if you wanted to).

ADD REPLYlink
0
Entering edit mode

you can use seqkit locate @ PaSua

Example input:

$ cat test.fa 
>a
TGTAAACCTTTTCATACTEAAGATTTGTAAACCTTTTCATGACCGTAGTGTAAACCTTTTCA
>b
ATCGATGCGATTGTAAACCTTTTCAATGCGATGACTGTAAACCTTTTCA

output:

$ seqkit locate -idp "TGTAAACCTTTTCA" test.fa

seqID   patternName pattern strand  start   end matched
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   1   14  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   26  39  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   49  62  TGTAAACCTTTTCA
b   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   12  25  TGTAAACCTTTTCA
ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1