LTR-Harvest false positives (Maker repeat library construction)
2
0
Entering edit mode
6.4 years ago
timo.metz • 0

Hey there,

currently I want to to a genome annotation using maker. For the repeats I followed the following protocol:

http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced

In there they use LTR-Harvest to find LTRs and then want to remove false positives. For this they provide perl scripts. I came up to point 2.1.3 where they run CRL-Step2 and should get as output two different files. In the one file, they want to have the upstream or downstream 50bp flanking sequences of LTRs having gaps. However, I do not get this file. Has anybody encountered this problem before?

If not, is there another way to filter out false positives introduced by LTR-Harvest?

kind regards

annotation maker ltr-harvest • 2.0k views
ADD COMMENT
0
Entering edit mode
5.5 years ago

Hello there, I understand this is an old thread, but i was wondering if you managed to solve this problem? I face the same problem with a list of error and no flanking sequence file.

Thank you.

ADD COMMENT
0
Entering edit mode
4.9 years ago
vflorelo • 0

I checked the code from Meg Bowman, and apparently, at least in my case it had to do with whitespaces in the fasta deflines. In the original script there were these lines

my $repid = $seqobj1->display_id();
my $repdesc = $seqobj1->desc();
my $new_desc = $repdesc; 
$new_desc =~ s/ /_/g;
my $new_id = "$repid" . "_" . "$new_desc";
my $blank_desc = " "; 
$repdesc = $seqobj1->desc($blank_desc);
$repid = $seqobj1->display_id($new_id);
$seqout->write_seq($seqobj1);

and then the script parsed $repid to construct $silly_index and $ltr_start using

my ($silly_index, $ltr_start, $seq_id, $remid, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
      $remid = $seqobj2->display_id();
      if ($remid =~ /^(.+)_\(/) {
        $seq_id = $1;
      }
      else {
      }
      if ($remid =~ /\(dbseq-nr_(\d+)\)_\[/) {
        $silly_index = $1;
      }
      if ($remid =~ /\[(\d+),/) {
        $ltr_start = $1;
      }

If you replace the first group of lines lines with

my $repid      = $seqobj1->display_id();
my $repdesc    = $seqobj1->desc();
my $new_desc   = $repdesc;
$new_desc      =~ s/.*\(dbseq-nr //g;
$new_desc      =~ s/\) \[/ /g;
$new_desc      =~ s/\,.*//g;
$repdesc       = $seqobj1->desc($new_desc);
$repid         = $seqobj1->display_id($repid);
$seqout->write_seq($seqobj1);

and the second group of lines with

my ($silly_index, $ltr_start, $seq_id, $remid, $remdesc, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
      $remid   = $seqobj2->display_id();
      $remdesc = $seqobj2->desc();
      my @remdesc_arr = split " ", $remdesc;
      $seq_id = $remid;
      $silly_index = $remdesc_arr[0];
      $ltr_start = $remdesc_arr[1];

It should work fine. I'm no perl expert and there should be a better fix, but in the meantime this patch works. In my case there are 9 repeats, after step2 I have 19 additional files 1 with the filtered repeat sequences, 9 with the upstream sequences and 9 with the downstream sequences. Hope it helps

ADD COMMENT

Login before adding your answer.

Traffic: 1593 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6