I checked the code from Meg Bowman, and apparently, at least in my case it had to do with whitespaces in the fasta deflines.
In the original script there were these lines
my $repid = $seqobj1->display_id();
my $repdesc = $seqobj1->desc();
my $new_desc = $repdesc;
$new_desc =~ s/ /_/g;
my $new_id = "$repid" . "_" . "$new_desc";
my $blank_desc = " ";
$repdesc = $seqobj1->desc($blank_desc);
$repid = $seqobj1->display_id($new_id);
$seqout->write_seq($seqobj1);
and then the script parsed $repid
to construct $silly_index
and $ltr_start
using
my ($silly_index, $ltr_start, $seq_id, $remid, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
$remid = $seqobj2->display_id();
if ($remid =~ /^(.+)_\(/) {
$seq_id = $1;
}
else {
}
if ($remid =~ /\(dbseq-nr_(\d+)\)_\[/) {
$silly_index = $1;
}
if ($remid =~ /\[(\d+),/) {
$ltr_start = $1;
}
If you replace the first group of lines lines with
my $repid = $seqobj1->display_id();
my $repdesc = $seqobj1->desc();
my $new_desc = $repdesc;
$new_desc =~ s/.*\(dbseq-nr //g;
$new_desc =~ s/\) \[/ /g;
$new_desc =~ s/\,.*//g;
$repdesc = $seqobj1->desc($new_desc);
$repid = $seqobj1->display_id($repid);
$seqout->write_seq($seqobj1);
and the second group of lines with
my ($silly_index, $ltr_start, $seq_id, $remid, $remdesc, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
$remid = $seqobj2->display_id();
$remdesc = $seqobj2->desc();
my @remdesc_arr = split " ", $remdesc;
$seq_id = $remid;
$silly_index = $remdesc_arr[0];
$ltr_start = $remdesc_arr[1];
It should work fine. I'm no perl expert and there should be a better fix, but in the meantime this patch works. In my case there are 9 repeats, after step2 I have 19 additional files 1 with the filtered repeat sequences, 9 with the upstream sequences and 9 with the downstream sequences. Hope it helps