How To Score For The Repetitiveness Of A Dna Sequence?
3
6
Entering edit mode
13.0 years ago

Hi,

Is there any tool or web service to score for the level of repetitiveness of a DNA sequence? Maybe something based on a repeat database or something like that.

[edit] For example, is there any web service where I could quickly input my short sequence and get a repetitiveness score calculated against RepBase?

Cheers

repeats dna sequence • 4.9k views
ADD COMMENT
5
Entering edit mode

We could help more if you provided more context to your question: What genome? What are you trying to accomplish? Maybe you are looking for mapability scores: http://hgwdev.cse.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeMapability#TRACK_HTML

ADD REPLY
4
Entering edit mode
13.0 years ago

It depends on the species and the type of repeat you are wishing to find. Look at the reports from RepeatMasker using the database, RepBase. DUST and SEG can also be used if you are just interested in low complexity sequence

Note that if you are using a model organism, Ensembl has most of this precalculated.

ADD COMMENT
2
Entering edit mode
13.0 years ago
brentp 24k

@Brad and @Alastair give more traditional answers. But, you can also check for some sort of sequence simplicity/repetitiveness using compression.

Andrew Dalke has a writeup on doing something similar for molecular structures here that I always thought was a cool idea for its simplicity. It could translate (so to speak) to sequence data quite easily. The idea being that simple sequences will be more compressible.

You could do this in a moving window or blockwise fashion to get an idea of the repetitiveness by taking the ratio of original-sequence-length to compressed-sequence-length; more compressible == more repetitive.

ADD COMMENT
1
Entering edit mode
13.0 years ago

As Alastair suggests, I too would use RepeatMasker. I did that often and used the reports to state the percentage of bases in teh query masked by certain types of repeats, including low-complexity and complex repeats. I do not recall the lower and upper bounds of length for the input, but I never ran up against those. So, you could create a scoring mechanism based on the different types of repeats as well as the total number of bases masked (represented as repeats or low-complexity seq).

ADD COMMENT

Login before adding your answer.

Traffic: 1891 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6