Question

Databases To Analyse Dna Characteristics/ Patterns

1

Entering edit mode

12.0 years ago

PoGibas 5.1k

I have collection of specific breakpoints and their flanking sequences

<------150bp-----/breakpoint/-----150bp----->

I want to find out if their is something in common in between those sequences (trying to compare & find some similarity).

I am interested in any database to decipher sequence patterns, characteristics (motifs, repeats, physical characteristics etc.)

I have already tried :

MEME motif search;
Vienna package for secondary structures;
Repeat masker;
Blast2 - for similarity around the breakpoint;
EMBOSS tools -UCSC tables for annotated info about repeats, histones, Dnase sites etc.

I am looking forward for any suggestions: TF binding, more secondary structures, more motifs, repeats and specific elements. Especialy protein (chromatin remodeling), nucleases , Ig target sites!

All suggestions are welcome - I am going to try them all.

dna motif • 2.5k views

ADD COMMENT • link updated 12.0 years ago by Steve Lianoglou 5.2k • written 12.0 years ago by PoGibas 5.1k

0

Entering edit mode

I used Vienna Package to get the energetic parameters of my DNA - secondary structures are cool (maybe way too good), but Vienna is always giving some kind of a structure and what I am interested more is a way to measure all possible DNA mechanical characteristics (flexing, bending etc.) Hope someone could help me with and easy way of doing it as Vienna was too "fancy".

ADD REPLY • link 12.0 years ago by PoGibas 5.1k

score 2 · Answer 1 · 2012-04-23

Assuming you have enough examples, how about trying to set this up a classification problem?

If you're hunting for motifs, you can iterate all x- to y-mers in the flanking upstream and downstream features separately (where x and y are the min and max size of kmers you are looking for). These represent the features of your examples. (If you think other features are important, toss them in too).

Your breakpoint regions are your positive set. Pick an equally sized (or larger) set of breakpoints at random (or ones that look like your breakpoint in some way you think these should look but have no breakpoint) and this will be your negative set.

Run your data through some binary classifer (SVM, penalized logistic regression, boosting, etc.) doing appropriate cross validation and see if you can get good prediction accuracy. This will take some time to get right (assuming you can do so).

Once you can build a strong classifier, see if you can interrogate it to see which features are relevant.

score 1 · Answer 2 · 2012-04-23

1

Entering edit mode

12.0 years ago

Philipp Bayer 8.3k

Have you had a look at protein-domain-analysis using HMMER?

You could transform your sequences into all 6 possible reading frames, then translate to amino acids and then check for protein domains. Domains are based on Hidden Markov Models so you might get different results than what you've already tried.

ADD COMMENT • link 12.0 years ago by Philipp Bayer 8.3k

2

Entering edit mode

I'm not sure he specified that these were breakpoints in coding regions(?)

ADD REPLY • link 12.0 years ago by Steve Lianoglou 5.2k

0

Entering edit mode

True! This should only work on coding sequences.

ADD REPLY • link 12.0 years ago by Philipp Bayer 8.3k