Question

How To Score For The Repetitiveness Of A Dna Sequence?

6

Entering edit mode

13.0 years ago

2184687-1231-83- ★ 5.1k

Hi,

Is there any tool or web service to score for the level of repetitiveness of a DNA sequence? Maybe something based on a repeat database or something like that.

[edit] For example, is there any web service where I could quickly input my short sequence and get a repetitiveness score calculated against RepBase?

Cheers

repeats dna sequence • 4.9k views

ADD COMMENT • link updated 12.7 years ago by Larry_Parnell 16k • written 13.0 years ago by 2184687-1231-83- ★ 5.1k

5

Entering edit mode

We could help more if you provided more context to your question: What genome? What are you trying to accomplish? Maybe you are looking for mapability scores: http://hgwdev.cse.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeMapability#TRACK_HTML

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.0 years ago by Brad Chapman 9.7k

score 4 · Answer 1 · 2011-04-26

It depends on the species and the type of repeat you are wishing to find. Look at the reports from RepeatMasker using the database, RepBase. DUST and SEG can also be used if you are just interested in low complexity sequence

Note that if you are using a model organism, Ensembl has most of this precalculated.

score 2 · Answer 2 · 2011-04-26

@Brad and @Alastair give more traditional answers. But, you can also check for some sort of sequence simplicity/repetitiveness using compression.

Andrew Dalke has a writeup on doing something similar for molecular structures here that I always thought was a cool idea for its simplicity. It could translate (so to speak) to sequence data quite easily. The idea being that simple sequences will be more compressible.

You could do this in a moving window or blockwise fashion to get an idea of the repetitiveness by taking the ratio of original-sequence-length to compressed-sequence-length; more compressible == more repetitive.

score 1 · Answer 3 · 2011-04-26

As Alastair suggests, I too would use RepeatMasker. I did that often and used the reports to state the percentage of bases in teh query masked by certain types of repeats, including low-complexity and complex repeats. I do not recall the lower and upper bounds of length for the input, but I never ran up against those. So, you could create a scoring mechanism based on the different types of repeats as well as the total number of bases masked (represented as repeats or low-complexity seq).