Question

LTR-Harvest false positives (Maker repeat library construction)

0

Entering edit mode

6.4 years ago

timo.metz • 0

Hey there,

currently I want to to a genome annotation using maker. For the repeats I followed the following protocol:

http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/Repeat_Library_Construction-Advanced

In there they use LTR-Harvest to find LTRs and then want to remove false positives. For this they provide perl scripts. I came up to point 2.1.3 where they run CRL-Step2 and should get as output two different files. In the one file, they want to have the upstream or downstream 50bp flanking sequences of LTRs having gaps. However, I do not get this file. Has anybody encountered this problem before?

If not, is there another way to filter out false positives introduced by LTR-Harvest?

kind regards

annotation maker ltr-harvest • 2.0k views

ADD COMMENT • link updated 4.9 years ago by vflorelo • 0 • written 6.4 years ago by timo.metz • 0

score 0 · Answer 1 · 2018-10-04

0

Entering edit mode

5.5 years ago

prateekshettys • 0

Hello there, I understand this is an old thread, but i was wondering if you managed to solve this problem? I face the same problem with a list of error and no flanking sequence file.

Thank you.

ADD COMMENT • link 5.5 years ago by prateekshettys • 0

score 0 · Answer 2 · 2019-05-13

I checked the code from Meg Bowman, and apparently, at least in my case it had to do with whitespaces in the fasta deflines. In the original script there were these lines

my $repid = $seqobj1->display_id();
my $repdesc = $seqobj1->desc();
my $new_desc = $repdesc; 
$new_desc =~ s/ /_/g;
my $new_id = "$repid" . "_" . "$new_desc";
my $blank_desc = " "; 
$repdesc = $seqobj1->desc($blank_desc);
$repid = $seqobj1->display_id($new_id);
$seqout->write_seq($seqobj1);

and then the script parsed $repid to construct $silly_index and $ltr_start using

my ($silly_index, $ltr_start, $seq_id, $remid, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
      $remid = $seqobj2->display_id();
      if ($remid =~ /^(.+)_\(/) {
        $seq_id = $1;
      }
      else {
      }
      if ($remid =~ /\(dbseq-nr_(\d+)\)_\[/) {
        $silly_index = $1;
      }
      if ($remid =~ /\[(\d+),/) {
        $ltr_start = $1;
      }

If you replace the first group of lines lines with

my $repid      = $seqobj1->display_id();
my $repdesc    = $seqobj1->desc();
my $new_desc   = $repdesc;
$new_desc      =~ s/.*\(dbseq-nr //g;
$new_desc      =~ s/\) \[/ /g;
$new_desc      =~ s/\,.*//g;
$repdesc       = $seqobj1->desc($new_desc);
$repid         = $seqobj1->display_id($repid);
$seqout->write_seq($seqobj1);

and the second group of lines with

my ($silly_index, $ltr_start, $seq_id, $remid, $remdesc, $artificial_key, %artificial_key_hash, $i2, $key3);
my $removed = Bio::SeqIO->new (-format => 'fasta', -file => $removed_repeats);
while (my $seqobj2 = $removed->next_seq()) {
      $remid   = $seqobj2->display_id();
      $remdesc = $seqobj2->desc();
      my @remdesc_arr = split " ", $remdesc;
      $seq_id = $remid;
      $silly_index = $remdesc_arr[0];
      $ltr_start = $remdesc_arr[1];

It should work fine. I'm no perl expert and there should be a better fix, but in the meantime this patch works. In my case there are 9 repeats, after step2 I have 19 additional files 1 with the filtered repeat sequences, 9 with the upstream sequences and 9 with the downstream sequences. Hope it helps