Question

How Can I Maintain Feature Counts When Removing Sequences Within Their Boundaries?

1

Entering edit mode

13.0 years ago

Steve Moss 2.3k

I'm currently using integer spans (Set::IntSpan::Fast) as a fast (or the fastest way I can think of) to add start and end co-ordinates for particular genomic co-ordinates to hashes for each gene ID for whole genomes using Perl. This then represents the individual ranges and thus sizes of multiple features (e.g. exons) within the genes.

I also do a similar thing for any repetitive sequence ranges (within the original features), but instead remove this from those original integer spans for each gene (hash), in order to get the unique feature lengths, minus the repetitive sequence (e.g. TEs).

However, this currently creates a bias towards more, smaller gene features, even though the overall feature sequence length is correct. This is because if a sequence is removed from the middle of one of the features, then it will return two smaller segments, even though, what I really want to do, is treat it as one segment still, but use it's smaller length - to build a new frequency distribution of unique sizes!

E.g.

# add three features of 20, 30 & 40bp respectively
$hash{$gene_id}->add_ranges("1-20", "51-80", "111-150")

# remove 3 repetitive elements of 5, 10 and 20bp
$hash{$gene_id}->del_ranges("11-15", "61-70", "121-140")

This results in 6 ranges of 1-10, 16-20, 51-60, 71-80, 111-120, and 141-150. Where as the overall resulting sequence length is correct, the number of features can't increase, but must stay the same (or reduce if the feature is all repetitive). So I need a way to join the fragments back up, relative to the original ranges.

Can you think of the most efficient way I can do this? Perhaps check the integer span boundaries before I remove the sequence and adjust the removal to the boundary of the feature instead (I don't need to know about start or end coordinates after, just the new unique size and frequency of the original features minus the repeat sequence).

perl genome programming gene repeats • 2.4k views

ADD COMMENT • link updated 13.0 years ago by brentp 24k • written 13.0 years ago by Steve Moss 2.3k

score 2 · Answer 1 · 2011-05-14

I think if you're going gene-by-gene, you can just use perl arrays and slice out the repetitive sequence. Here's a vanilla perl function that does that given a gene length, an array-ref of exons and an array-ref of ranges to delete:

use strict;

sub del_ranges {
    my $gene_length = shift;
    my $ranges = shift;
    my $dels = shift;
    my @ranges = @$ranges;
    my @dels = @$dels;

    # create a gene of all zeros.
    my @gene = map { 0 } 0 .. $gene_length;
    # put 1's for the ranges.
    map { $gene[$_] = 1 } @ranges;

    # delete out the gaps.
    foreach my $del (@dels){
        my $start = $del->[0];
        my $len = scalar(@$del);
        splice(@gene, $start, $len);
    }

    # starts are transitions from 0 to 1
    my @starts = grep { $gene[$_-1] == 0 && $gene[$_] == 1 } 1..$gene_length;
    # ends are transitions from 1 to 0
    my @ends   = grep { $gene[$_+1] == 0 && $gene[$_] == 1 } 1..$gene_length;

    # pair up as ([start, end], ...)
    my @new_gene = map { [$starts[$_], $ends[$_]] } 0..$#ends;
    return \@new_gene;
}

my $gene_length = 150;
my @ranges = (1..20, 51..80, 111..150);
my @dels = reverse ([11..15], [61..70], [121..140]);

my $del_gene = del_ranges($gene_length, \@ranges, \@dels);
map { print $_->[0] . "-" . $_->[1] . "\n" } @$del_gene;

This will print out:

1-15
46-65
96-115