Question

Converting Solexa Reads To Fastq And Identifying Mismatched Bases

0

Entering edit mode

10.9 years ago

soosus ▴ 90

I would like to write a simple script or single line perl to convert the following two Solexa sequence reads (assume these reads in a text file "Solexa.export.txt") into fastq format. (Note: the quality scores are coded as ASCII offset of 64):

ILLUMINA-A93428 66      1       1       7848    1267    CGATGT  1       ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG   ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB chrX.fa         122287709       R       40C21GAC7       269                Y

ILLUMINA-A93428 66      1       1       8782    1281    CGATGT  1       GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG   hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`] chrX.fa         144709339       F       72      335                        Y

Also need help identifying mismatched bases in "Solexa.export.txt"

r fastq • 3.1k views

ADD COMMENT • link updated 10.9 years ago by Mitch Bekritsky ★ 1.3k • written 10.9 years ago by soosus ▴ 90

score 0 · Answer 1 · 2013-05-17

0

Entering edit mode

10.9 years ago

Sean Davis 26k

You might try this script to convert to SAM/BAM. Then, you can use tools like picard Sam2Fastq and samtools calmd to get the information you need.

https://github.com/samtools/samtools/blob/master/misc/export2sam.pl

ADD COMMENT • link 10.9 years ago by Sean Davis 26k

score 0 · Answer 2 · 2013-05-17

Try this:

awk '/./ {print "@"$1"\n"$9"\n+\n"$10}' Solexa.export.txt

The '/./' in the awk command skips any blank lines.

When I run this one-liner on your sample script, it produces the following output:

@ILLUMINA-A93428
ATGTTGGCTGGCGGTGAAATAAATCTCAAACGTACCTGTTACAAAATTGTGTTAATCCTTTCAGATTCGCAG
+
ggggggggggggggcfdffdgggfefffffcfcffcccacBeddddded_cffefgggggBBBBBBBBBBBB
@ILLUMINA-A93428
GACTCCGGGAAAGCAAGCTGAAGTGTTTTGTGGGACAGGCACATCATTTGGTCAACTCCTTTTCCCTGGTTG
+
hhhhhhhhhhhhhhhhhhhhghfehhhhhhghfhf_ffffghhhhhhhhdaaghhghhhhgfeghfghhe`]

When you say that the encoding is offset by 64, does that mean you want to change it, or is that just an FYI?

Also, what do you mean that you want to identify mismatched bases? Do you just want to find reads that don't match perfectly? In that case, try this:

awk '$14 !~ /^[0-9]+$/ && /./' Solexa.export.txt

I'm assuming that the 14th field in Solexa.export.txt reports the alignment of the read to the reference genome. In that case, any entry where field 14 isn't only digits would imply that there are mismatched bases. In your second sequence, field 14 reports 72, the length of the read. In the first sequence, it reports 40C21GAC7, implying 40 matched bases, then a mismatched C, etc.