Capillary sequence trimming algorithm
0
1
Entering edit mode
5.4 years ago
tospo ▴ 50

I need to implement a quality trimming algorithm for capillary sequence reads (yes, they still exist!) in a Perl package of my own, so I am trying to implement the algorithm used by the PHRED software as well as other tools. There is a description of the Mott algorithm here: https://www.codoncode.com/support/faqs/trimming.htm (under "More about trimming") and in the PHRED manual (see --trim_alt) here: http://www.phrap.org/phredphrap/phred.html

There is an implementation of the algorithm in BioPyhton here: http://biopython.org/DIST/docs/api/Bio.SeqIO.AbiIO-pysrc.html#_abi_trim

But that confuses me: while the original algorithm tries to identify the highest scoring sub-sequence, the BioPython algorithm starts the trimmed sequence at the first base that reaches the quality threshold. That seems strange to me because that could well be followed by a run of bases with very low quality, which should be trimmed.

I have interpreted the Mott algorithm as follows in Perl, which seems to produce good results when I manually check the trimming of some sequence reads I have got. But I am slightly worried that I am missing something because the BioPhyton interpretation is so different. Another thing that I don't get is that the description of the original algorithm says "the score can have non-negative values only" but if I set all negative scores to zero, there is no penalty for low quality scores anymore or am I missing something?

sub quality_trimmed_seq {
  my $seq = shift or croak("need a Bio::Seq object");
  if ( !blessed( $seq ) or !$seq->isa('Bio::Seq::SequenceTrace') ){
    croak("input to quality_trimmed_seq must be a 'Bio::Seq::SequenceTrace object, not a".
      (ref $seq||'scalar')
    )
  }
  my $cutoff = shift || 0.05;
  my $segment_len = shift || 20;  # minimum length of a segment;
  die "segment_len must be > 1" if $segment_len < 1;

  # if the sequence is too short, return undef
  return undef if $seq->length < $segment_len;

  # transform the Q values into error probabilities
  my @scores;
  foreach my $qual ( @{ $seq->qual } ){
    my $score = $cutoff - ( 10 ** ( $qual / -10.0 ) ) ;
    push @scores, $score;
  }

  # traverse all possible sub-segments of the sequence
  # and calculate a sum score
  my $max_score = 0;
  my $trim_start = 1;
  my $trim_end = $seq->length;
  foreach my $start ( 1 .. $seq->length - 1 ){
    my $first_end = $start + $segment_len - 1;
    next if $first_end > $seq->length;
    foreach my $end ( $first_end .. $seq->length){
      my $score_sum = 0;
      foreach my $base_i ( $start - 1 .. $end - 1  ){
        $score_sum += $scores[$base_i];
        if ( $score_sum > $max_score ){
          $max_score = $score_sum;
          $trim_start = $start;
          $trim_end = $end;
        }
      }
    }
  }
  if ( $max_score > 0 ){
    return $seq->sub_trace_object( $trim_start, $trim_end);
  } else {
    return undef;
  }
}
sequencing capillary • 914 views
ADD COMMENT

Login before adding your answer.

Traffic: 2580 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6