How To Control Wgsim Base Qualities?
3
2
Entering edit mode
12.4 years ago
Pascal ★ 1.5k

Hi

I'm trying to make GATK UnifiedGenotyper to detect indels inserted by wgsim short reads simulator.

Broad employees just informed me that indels are not called because wgsim creates reads of Q17 and everything below 20 is thrown away.

I checked the fastq files, and indeed, all qualities are "2" which is if I understood well the ASCII translation corresponds to 17.

Is there a way to make wgsim output reads with a configurable base quality?

I thought that a trick could be to replace these "2222...2" strings by somthing like "????...?" Or is there a more elegant way to proceed?

Thanks.

fastq gatk • 5.5k views
ADD COMMENT
4
Entering edit mode
12.4 years ago

You could use the following awk script to change your 2s to Is:

awk ' /222222222/ { gsub("2", "I"); print $0; next } { print } '
ADD COMMENT
0
Entering edit mode

Double thanks! I will try both.

ADD REPLY
0
Entering edit mode

Double thanks Sir! I will try both solutions...

ADD REPLY
0
Entering edit mode

Wait a minute: I actually used the wgsim available at https://github.com/lh3/wgsim and this produces only "2" qualities ?!

ADD REPLY
0
Entering edit mode

my mistake - I have a file like that I thought I generated it with wgsim - you are correct it does put in 2s. I will correct the answer

ADD REPLY
0
Entering edit mode

No problem. Thanks for answering. I just awked my fastq files with your file and it is doing alignment now :->... I think I will move to MAQ as it looks it offers more control. Cheers.

ADD REPLY
2
Entering edit mode
11.0 years ago

I forked wgsim and made the quality score configurable using the -Q command line argument. I've made I the default value for the quality score (Phred quality score of 40). Feel free to use my fork at https://github.com/hammer/wgsim to alleviate the need for the post-processing step with awk.

Previously the quality score was scaled with the error rate, so that as the error rate went up, the quality score went down. The precise formula used is (-10.0 * log(ERR_RATE) / log(10.0) + 0.499) + 33 (see https://github.com/lh3/wgsim/blob/master/wgsim.c#L248). Because ERR_RATE is set to 0.02 by default, the quality score comes out to (-10.0 * log(0.02) / log(10.0) + 0.499) + 33 = 50.4887..., which is encoded using the character 2, which has the ASCII character 50.

ADD COMMENT
0
Entering edit mode
11 months ago
André • 0

To make what Jeff said easier, this table gives you error probabilities for each quality score. If you set the -e flag in wgsim to the value in the P_error column you will get the respective base quality. If you want a given error rate with a different quality then you are better off replacing the base qualities like Istvan said.

ADD COMMENT

Login before adding your answer.

Traffic: 2278 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6