working with Sanger fasta + qual + clip files
1
0
Entering edit mode
5.1 years ago
sullis02 ▴ 40

I have an old Sanger fasta file and its associated qual file, as well as a 'clip' file that shows where to trim low-quality sequence for each fasta sequence. e.g. for the first record in the fasta file here are the corresponding data in each file:

fasta:

>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACT
CACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTT
CTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTC
AACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATT
GATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAG
GAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTC
GGTTTTGAACCTCGGACCAAGTGAATCC

qual:

>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9
  9 10  9  9  9  9  9  9 12 12 10  9  9  9  9 12 21 12 13 17
 20 21 20 21 28 25 24 30 21 25 21 21 23 23 31 28 25 28 23 23
 29 23 44 33 33 29 34 34 26 29 32 32 32 32 40 40 38 29 45 35
 28 23 29 29 22 22 22 26 32 29 32 29 29 29 30 30 33 38 33 25
 25 35 20 20 22 20 20 29 29 27 24 22 30 24 27 24 34 25 34 38
 29 31 29 23 24 24 23 32 32 40 39 39 27 19 21 14 22 22 22 45
 40 26 44 38 28 28 26 40 32 34 36 40 36 40 39 27 21 23 21 39
 39 32 28 32 40 26 32 26 25 35 34 32 40 40 28 36 40 38 39 39
 36 32 29 28 28 28 28 29 45 40 36 40 34 29 32 24 18 18 23 27
 32 36 47 47 38 33 25 30 24 30 20 19 24 24 27 28 27 26 18 29
 25 23 27 27 21 15 22 23 30 29 35 38 40 31 34 24 21 26 23 18
 17 19 19 22 28 24 24 26 36 36 28 40 32 40 28 23 32 22 24 22
 22 22 22 29 29 26 39 40 29 32 24 28 29 29 40 36 32 28 29 24
 24 24 20 28 31 40 38 38 34 40 38 27 39 24 18 22 22 24 16 26
 23 39 40 28 28 49 49 32 27 27 23 28 29 27 44 34 29 25 20 27
 27 29 23 23 39 32 49 40 28 25 28 39 28 28 23 39 32 32 40 28
 23 27 27 23 27 36 39 17 17 26 49 49 32 39 39 39 28 28 19 18
 21 14 13 18 11 18 27 25 23 17 23 25 27 23 24 17 23 23 23 21
 39 27 25 32 32 36 32 18 16 12 23 19 20 29 26 28 21 26 26 26
 27 29 23 16 12 25 18 23 24 24 18 23 27 23 28 27 23 13 15 13
 13 14 13 20 11 15 25 23 27 24 25 27 24 19 23 27 23 27 23 20
 18 27 27 27 23 10 15 14

clip:

TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445

I'd like to generate a fastq file consisting of the trimmed fasta sequences + their qual scores. (Or, generate the fastq file of the whole fasta records first, then trim it using the clip data as instructions. )

What tools can do that?

sanger trimming fastq • 1.7k views
ADD COMMENT
0
Entering edit mode
5.1 years ago
JC 13k

A simple script can do the trick:

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use convert.pl FASTA QUAL CLIP > FASTQ\n";

my $phred = 33; # Define Phred scale, it can be 33 or 64

my $seq_file  = shift @ARGV;
my $qual_file = shift @ARGV;
my $clip_file = shift @ARGV;

my ($id, $seq, $qual, $c_ini, $c_end);

# Read fasta sequence
open (my $fh, "<", $seq_file) or die "cannot read $seq_file\n";
while (<$fh>) {
    chomp;
    if (/>(.+)/) {
       $id = $1;
    }
    else {
       $seq .= $_;
    }
}
close $fh;

# Read quals
open (my $qh, "<", $qual_file) or die "cannot read $qual_file\n";
while (<$qh>) {
    chomp;
    next if (/>/);
    s/^\s+//; # remove spaces at the begin
    s/\s+$//; # remove spaces at the end
    my @values = split (/\s+/, $_);
    foreach my $v (@values) {
        $qual .= chr($v + $phred);
    }
}
close $qh;

# check is sizes are equal
die "Sequence and quality lenghts are not equal\n" if (length $seq != length $qual);

# Read clip coordinates
open (my $ch, "<", $clip_file) or die "cannot read $clip_file\n";
while (<$ch>) {
    chomp;
    next if (/^TI/); # skip header
    my @elem = split (/\s+/, $_);
    #verify if the id is the same for the sequence
    if ($id =~ /$elem[0]/) {
        $c_ini = $elem[1];
        $c_end = $elem[2];
        # clip sequences
        $seq  = substr($seq,  $c_ini - 1, $c_end - $c_ini);
        $qual = substr($qual, $c_ini - 1, $c_end - $c_ini);
    }
    else {
        warn "clip id doesn't match sequence id, printing sequence without modifications\n";
    }
    last; #only one line is allowed to modify the sequence
}
close $ch;

print "\@$id\n$seq\n+\n$qual\n";

Example:

$ perl convert.pl seq.fasta seq.qual seq.clip
@gnl|ti|713904976 1047095293492
GTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGA
+
7?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<
ADD COMMENT
0
Entering edit mode

Thank you very much!!

ADD REPLY
0
Entering edit mode

Oops, wrote to soon. Have you tried your script with input files that contain more than a single fasta/clip/qual record? With a single record as input (as in your example) it works fine. With normal fasta /qual/clip files (which contain numerous records) it doesn't work it reports ' clip id doesn't match sequence id, printing sequence without modifications' and then concatenates all the fasta and qual text into giant globs.

e.g the input fasta here has two records, as do the qual and clip files

1-2.fasta
>gnl|ti|713904976 1047095293492
CCCCAGAAGCCCCGCGCTT...etc....GTGAATCC
>gnl|ti|713904977 1047095293493
TCGGGAGGCCAGGGGTTTT...etc....ATATCTC

1-2.qual
>gnl|ti|713904976 1047095293492
  9  9  9  8  6  9  6  7 10 10 10 10  9  9  7  9  7 10 10  9  etc...

>gnl|ti|713904977 1047095293493
  9  9 10 10  9  9  9  9 10 12  9 10 11 20 17  7 10 15  9 10 etc....

1-2.clip
TI      CLIP_LEFT       CLIP_RIGHT
713904976       112     445
713904977       108     805

   $ perl convert.pl  001_1-2.fasta 001_1-2.qual 001_1-2.clip
    clip id doesn't match sequence id, printing sequence without modifications
    @713904977 1047095293493
    CCCCAGAAGCCCCGCGCTTTTTCCATTTCACGACGTTGTAAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCTCGGGAGGCCAGGGGTTTTTCCAGTCACGACGTTGTAAACGACGGCCAGTGAATTGTAATACGACTCACTATAGGGCGAATTCGAGCTCGGTACCCGGGGATCCCACGTACGCTTCATCATCGCCCTTTTCAATTTCTGCATATTCCGATGGCTTGTATGAGACAAACGCATTCATTTTCCAGCTTTTATCCAAAGCAAGTTACTCAACTCCATATTATTCAAGTTTGATCCACTCCAGGGCCATCTTTTCTCGATGGACAACAGGGGACATCATTGATGGATCGGATGACGGAGGTGCTTCTGTTTCTGCAAAGAAAACTCTGCAAAAAGACGTAAATGAAACAGGAGGAATCTTCAGAGGAACAAGAAGTTACATGGACTTCCCATTTGCAAACTACAGAGATACTTCCAACTCGGTTTTGAACCTCGGACCAAGTGAATCCAAACTGAATTTCACTACTGCTCCAACAAATGCACCAGGAGGACTTTCTGGAGTTCACATATTCCTAATTGTACTTGCAGTGATTGTTGTAGTAGTTGTTTGCGTAGTAGTTGTATGCTTAGTTCACAGATCCAACCAATCAAAGGCGGAAATGAACAAAAAGTTCAACTCAGACAAAGGTGATGAACTATCTGATGTAGAGGACAAAGAAACTCAACAGACTGGGTACATGTCTGGAACTCCTTCTCCTCAAAGCTCTGCTGCAACTACTACTACTGGTCATAACTCTGCAACTTCTACTGTGAATAGCACGGGAGTTGTTCCGGCTGATCCAAACTTGAGGGTGTGGAATCCGTACGAGATTTCGCCGACTACAACAACTAATGAAAAATAGTTTTGTTTATTCATGATATTGTTTCGTTAGTTGCTATTGTTAATTGTTCTATTTATAAATATAAGTTCTGTTTGTTTTATTTTTCTTATCGATGCTTCCTTCCTAAATAGAAATGTTTTTTTTTAGATCATAATATGATCAAAATTAAATTATGATTGACATGCGCAATTGACAAATTTTTCTTGGAAACAGGATGTATTTGTAATTATATCTTTAAAGAGCCTGATGCATGATATCTC
    +
    ***)'*'(++++**(*(++**+******--+****-6-.25656=:9?6:6688@=:=88>8MBB>CC;>AAAAIIG>ND=8>>777;A>A>>>??BGB::D55755>><97?9<9C:CG>@>8998AAIHH<46/777NI;MG==;IACEIEIH<686HHA=AI;A;:DCAII=EIGHHEA>====>NIEIC>A9338<AEPPGB:?9?5499<=<;3>:8<<6078?>DGI@C96;832447=99;EE=IAI=8A797777>>;HI>A9=>>IEA=>9995=@IGGCIG<H937791;8HI==RRA<<8=><MC>:5<<>88HARI=:=H==8HAAI=8<<8<EH22;RRAHHH==436/.3,3<:828:<8928886H<:AAEA31-845>;=6;;;<>81-:389938<8=<8.0../.5,0:8<9:<948<8<853<<<8+0/**++****+-*+,52(+0*+*+**--112=EENNEIENENC>>>355333>;:689>8M>?>=CB<::B@DI>>C>9D:<798<<8EA9MB8>;BB?G:;;<:C=>97??B>I79968<:==8AAAEI:887<5>E?9=;9=>AAA=AHAHA8==HAENAH8<<:<9=8=<<<:5<<IIAIAAAAAINPAA:5999>:4GG@>>;8888CADCEHEII=;;59=EEDGG@CDIAIII=;=@>D>EGIRRRI=E=EEH9C9D=;==;<2:99DCCHEAE=IHH@>9;?<<<=::9;<9<AAHE=EAIIE>;9867=HNNIAAN35859=EA=A==ANEAD>>C<9229A9>C>IA=>H>8H:E<>;88;EI=9;9;;<AA>>CIPPPEHA8<==?;=INHA=@9??=CIIA=8II>>9>@>IEEHICAAHEE?=;;;9=IAI;?=;;9EIHIIINAIAAAHD=EA:><>C;=IHRAA8:><8AA<==9C>RNNGHIAIE>;9>>>HIIAAMIEMIAAI=NI=8=AEHEEH<<51>8<H69<<EHGNH=AH63;18IACIIII<><1<HIAEA=H?9>>>>IAIEII=IIREAA>668;<96;C<>>;<5696:583/<:9IIIMA886469==E=G;@;<53:.835>>8870<<<@;=AHAA=HIEMIEDD9:>CEA=II>DB>DCIEAC>975269C<CCCMMB@@>>=;5/+436BADMII>CCIMMPR@@?C<8HHA=IAIIACCCG>C;8<;;=IE><?7;;98AENNN@<>7799=79;2777?>IBB:==754-/*3.9/.+*+-7>ND>C?C=:EEC9??9=7;;;9=A9?C<CCD;B?87843;756;;;/0+*+*+,3//**-63>771.1**0,1/*+4/*,,3-+034-0**++230++1-2+*,**.**++*)-+*+(*-+,//+*1/533,++*..-*****+6-.+***-++++/1*-*+*+*+.**-+*******,+*,**+(++**+**..,*****+/+*********+**''*'**+*)***+*****++++++***'*****************'**
ADD REPLY
0
Entering edit mode

The script is intended to work on one sequence per file, to use with a multifasta, the data needs to be saved in memory in a Hash.

ADD REPLY

Login before adding your answer.

Traffic: 2656 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6