Biostar Beta. Not for public use.
Find binding sites start and end positions in genome using biophyton
0
Entering edit mode
19 months ago
PaSua • 0

Hello everyone, Python newby here.

I´m currently working with transcription factor binding sites, and I have several sequences of binding sites for which I don´t know their position in the genome. For instance, I have this sequence "TGTAAACCTTTTCA", which belongs to NC_009004.1, and I was wondering if there is a way for finding the sequence position (start and end) using biopython or any other approach.

Sorry if this is a simple or naive question, but I've been trying to solve it by myself checking some videos, books and cookbooks and so far I've got nothing.

0
Entering edit mode

If your genome is not too long you can try the pairwise alignment from BioPython http://biopython.org/DIST/docs/api/Bio.pairwise2-module.html

Otherwise you'll have to align this sequence using a proper alignment software like BWA or Bowtie2

Edit : Lactococcus lactis, 2.5M bases

You can give a try to BioPython and check the running time

0
Entering edit mode

Transcription factors bind to motifs which can have some variation in nucleotide composition. Better use a dedicated tool such as fimo from the MEME suite for this. You pattern matching approach would require 100% sequence identity which is simply not how transcription factor binding works.

0
Entering edit mode

BLAST would be the obvious choice, but expect lots of hits so you'll have to do some filtering a posteriori.

(You needn't use BLAST via python, but you could if you wanted to).

0
Entering edit mode

you can use seqkit locate @ PaSua

Example input:

$cat test.fa >a TGTAAACCTTTTCATACTEAAGATTTGTAAACCTTTTCATGACCGTAGTGTAAACCTTTTCA >b ATCGATGCGATTGTAAACCTTTTCAATGCGATGACTGTAAACCTTTTCA  output: $ seqkit locate -idp "TGTAAACCTTTTCA" test.fa

seqID   patternName pattern strand  start   end matched
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   1   14  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   26  39  TGTAAACCTTTTCA
a   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   49  62  TGTAAACCTTTTCA
b   TGTAAACCTTTTCA  TGTAAACCTTTTCA  +   12  25  TGTAAACCTTTTCA