How Can I Maintain Feature Counts When Removing Sequences Within Their Boundaries?
1
1
Entering edit mode
13.0 years ago

I'm currently using integer spans (Set::IntSpan::Fast) as a fast (or the fastest way I can think of) to add start and end co-ordinates for particular genomic co-ordinates to hashes for each gene ID for whole genomes using Perl. This then represents the individual ranges and thus sizes of multiple features (e.g. exons) within the genes.

I also do a similar thing for any repetitive sequence ranges (within the original features), but instead remove this from those original integer spans for each gene (hash), in order to get the unique feature lengths, minus the repetitive sequence (e.g. TEs).

However, this currently creates a bias towards more, smaller gene features, even though the overall feature sequence length is correct. This is because if a sequence is removed from the middle of one of the features, then it will return two smaller segments, even though, what I really want to do, is treat it as one segment still, but use it's smaller length - to build a new frequency distribution of unique sizes!

E.g.

# add three features of 20, 30 & 40bp respectively
$hash{$gene_id}->add_ranges("1-20", "51-80", "111-150")

# remove 3 repetitive elements of 5, 10 and 20bp
$hash{$gene_id}->del_ranges("11-15", "61-70", "121-140")

This results in 6 ranges of 1-10, 16-20, 51-60, 71-80, 111-120, and 141-150. Where as the overall resulting sequence length is correct, the number of features can't increase, but must stay the same (or reduce if the feature is all repetitive). So I need a way to join the fragments back up, relative to the original ranges.

Can you think of the most efficient way I can do this? Perhaps check the integer span boundaries before I remove the sequence and adjust the removal to the boundary of the feature instead (I don't need to know about start or end coordinates after, just the new unique size and frequency of the original features minus the repeat sequence).

perl genome programming gene repeats • 2.4k views
ADD COMMENT
2
Entering edit mode
13.0 years ago
brentp 24k

I think if you're going gene-by-gene, you can just use perl arrays and slice out the repetitive sequence. Here's a vanilla perl function that does that given a gene length, an array-ref of exons and an array-ref of ranges to delete:

use strict;

sub del_ranges {
    my $gene_length = shift;
    my $ranges = shift;
    my $dels = shift;
    my @ranges = @$ranges;
    my @dels = @$dels;

    # create a gene of all zeros.
    my @gene = map { 0 } 0 .. $gene_length;
    # put 1's for the ranges.
    map { $gene[$_] = 1 } @ranges;

    # delete out the gaps.
    foreach my $del (@dels){
        my $start = $del->[0];
        my $len = scalar(@$del);
        splice(@gene, $start, $len);
    }

    # starts are transitions from 0 to 1
    my @starts = grep { $gene[$_-1] == 0 && $gene[$_] == 1 } 1..$gene_length;
    # ends are transitions from 1 to 0
    my @ends   = grep { $gene[$_+1] == 0 && $gene[$_] == 1 } 1..$gene_length;

    # pair up as ([start, end], ...)
    my @new_gene = map { [$starts[$_], $ends[$_]] } 0..$#ends;
    return \@new_gene;
}

my $gene_length = 150;
my @ranges = (1..20, 51..80, 111..150);
my @dels = reverse ([11..15], [61..70], [121..140]);

my $del_gene = del_ranges($gene_length, \@ranges, \@dels);
map { print $_->[0] . "-" . $_->[1] . "\n" } @$del_gene;

This will print out:

1-15
46-65
96-115
ADD COMMENT
0
Entering edit mode

Thanks brentp :-) I'll give this a whirl whe I get back in the lab on Monday!

ADD REPLY
0
Entering edit mode

It seems to be truncating the code for some reason, although it may be an issue with mobile Safari?

ADD REPLY
0
Entering edit mode

looks ok to me. last line should be: map { print $->[0] . "-" . $->[1] . "n" } @$del_gene;

ADD REPLY

Login before adding your answer.

Traffic: 1961 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6