Biostar Beta. Not for public use.
Quality scores in ICGC simple somatic mutation file
0
Entering edit mode
5.2 years ago
tralynca • 40
South Africa

Hi all,

I've recently downloaded the simple somatic mutation (SSM) file for clear cell renal cell carcinoma (ccRCC) from the ICGC Data Repository, but I've been having some trouble interpreting the quality score column.

Below is a snippet of my data ( .tsv file)

_chromosome chromosome_start chromosome_end chromosome_strand mutation_type reference_genome_allele mutated_from_allele mutated_to_allele quality_score probability total_read_count_

1 224822287 224822287 1 single base substitution T T G 223 46 26
1 224822287 224822287 1 single base substitution T T G 223 46 26

However, I'm not sure why the quality score is so high. For every entry the quality score is between 100 and 223. Some have said that Phred scores can in fact range from 0 to infinity (http://gatkforums.broadinstitute.org/discussion/4260/how-should-i-interpret-phred-scaled-quality-scores), while others say that scores in the 200 range probably means that the signal was too low (http://seqanswers.com/forums/showthread.php?t=23770).

The ICGC website has described the quality score column to be that of the mutation call and not that of alignment etc. (http://docs.icgc.org/simple-somatic-mutations-ssm-primary-analysis-file-p).

The rest of the columns say that samtools pileup was used for the raw variant calls among other analysis algorithms such as GATK, Picard, VCF tools etc. For all calls no verfication with an orthogonal platform or biological validation was carried out.

Can anyone confirm whether this does in fact infer great quality or if I should be looking out for something else.

Thanks in advance,

Tracey

ADD COMMENTlink
0
Entering edit mode
16 months ago
Ying W ♦ 3.9k
South San Francisco, CA

Could you link to the ccRCC SSM file? I looked through the SSM file here: https://dcc.icgc.org/repository/release_18/Projects/RECA-CN and it looks like they used Varscan but it doesn't show the quality scores that you pasted. The quality scores generated by varscan can be found here: http://varscan.sourceforge.net/somatic-calling.html#somatic-output there was a conversion process from varscan output to vcf

ADD COMMENTlink
0
Entering edit mode

Thank you for your response Ying, but I used the EU/FR data set since they carried out whole genome sequencing (https://dcc.icgc.org/repository/current/Projects/RECA-EU). They used and samtools mpileup for variant calling. Thank you for going through the trouble of pasting the link for the VarScan documentation.

If you also have some experience with samtools, I would be happy to hear your thoughts on the quality scores.

ADD REPLYlink
0
Entering edit mode

tbh i'm not very sure how samtools pileup/mpileup outputs quality values and which one is being used for the ssm file. There are multiple posts on this website asking about samtools/pileup/mpileup and quality values. To go back to your original question, I would assume that the high quality values mean that they are good enough for your purposes since they are being distributed, the lower quality variants were probably filtered. If you don't trust it, you would have to look for the raw data and do variant calling yourself (which you will have to get authorization for since tumor/normal bam files are protected patient data). I was under the impression that the data on icgc website will _eventually_ have normalized variant calling data using the same pipeline.

ADD REPLYlink
0
Entering edit mode

Hi Ying,

I've gone through the questions about samtools/mpileup but none of them seem to address the issue of the quality score. I did write to ICGC about two weeks ago and again today. I'm awaiting a response. Thank you.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3.1