COSMIC Mutation ID
1
0
Entering edit mode
8.9 years ago
byoo • 0

I have questions about observations about COSMIC Mutation ID in GRCh37 version of COSMIC v72.

I wonder if there is a reliable way to validate COSMIC Mutation ID by using downloadable data at http://grch37-cancer.sanger.ac.uk/cosmic/download

I guess it could be an option to make a request to COSMIC website to check but I would like to avoid if possible.

COSMIC Mutation Ids in the downloadable data are not always searchable in the COSMIC website and seems inconsistent.

In looking at CosmicCompleteExport.tsv.gz and VCF/CosmicCodingMuts.vcf.gz, I am not sure how I could understand the followings:

  1. Some COSM ids in VCF/CosmicCodingMuts.vcf.gz are not found in CosmicCompleteExport.tsv.gz
  2. Some COSM ids found in both VCF/CosmicCodingMuts.vcf.gz and CosmicCompleteExport.tsv.gz are not found in website.

    • Example: COSM330384 is found in both files but not found in COSMIC website: http://grch37-cancer.sanger.ac.uk/cosmic/mutation/overview?id=330384
      $ zcat cosmic/grch37/cosmic/v72/CosmicMutantExport.tsv.gz | grep -P "COSM330384\t"
      SLC4A11_ENST00000380059 ENST00000380059 2757            SCC-9   2296303 2161906 upper_aerodigestive_tract       head_neck       carcinoma       squamous_cell_carcinoma     y       COSM330384      c.77C>G p.P26R  Substitution - Missense         37      20:3218634-3218634      -       y       PASSENGER/OTHER Reported in another cancer sample as somatic        25275298                cell-line       NS      25
      # ...(many records more)
      $ zcat cosmic/grch37/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz | grep -P "COSM330384\t"
      20      3218634 COSM330384      G       C       .       .       GENE=SLC4A11_ENST00000380059;STRAND=-;SNP;GENE=SLC4A11_ENST00000380059;STRAND=-;CDS=c.77C>G;AA=p.P26R;CNT=10
      
  3. Some variants have multiple IDs assigned:

    • Example:
      $ zcat cosmic/grch37/cosmic/v72/VCF/CosmicCodingMuts.vcf.gz | grep -P "108175462\t"
      11      108175462       COSM3736031     G       A       .       .       GENE=ATM_ENST00000278616;STRAND=+;SNP;GENE=ATM_ENST00000278616;STRAND=+;CDS=c.5557G>A;AA=p.D1853N;CNT=2
      11      108175462       COSM41596       G       A       .       .       GENE=ATM;STRAND=+;SNP;GENE=ATM;STRAND=+;CDS=c.5557G>A;AA=p.D1853N;CNT=12
      

I would appreciate if you would give any advice.

COSMIC • 8.6k views
ADD COMMENT
1
Entering edit mode
8.9 years ago

Hi Byoo,

Unfortunately I have no solution for your question but just a couple of comments.

For the second case, on why some COSMIC ID are not present in the front end, this is something that is happening since version 70 of COSMIC. In their change log they have the following text:

Data Filtering

To improve the value of COSMIC data we have tried to identify the most significant high-value data within cancer genomes using the following filtering strategies -

Mutations

We have excluded data from any sample with over 15,000 mutations. In addition, we have flagged all known SNPs as defined by the 1000 genomes project, dbSNP and a panel of 378 normal (non-cancer) samples from Sanger CGP sequencing. Using this approach 812,136 mutations have been flagged. Although all data are included in our download files, we have excluded flagged mutations from the website.

As you can see from the highlighted section they are doing some flagging and filtering on the website. The problem here is that this flags don't seem to be available to the users to perform the same type of filtering during their analysis. I approached the COSMIC team about how to get access to this flags or the filter algorithm but didn't got any answer yet. Will post back here as soon as I hear from them.

For your third case. The problem comes from the gene models. I have seen this multiple times. As you can see from your example the two entries correspond to two "different" genes. In your case the two models look exactly the same apart from one of them having the EnsEMBL transcript identifier attached to the gene name, but I've seen other cases where one of the gene models correspond to an "alternative" spliced version of the gene so while the genomic coordinates are the same the aa position on the change may differ.

I am sorry I cannot be of more help but hopefully you find this information a bit helpful. I am trying to solve exactly the same problem so will keep you posted if I make any progress.

Cheers,
J.

ADD COMMENT
0
Entering edit mode

Thank you, Julio. It did help me a lot. I did a little more investigation. It sometimes assigns different COSM Ids to different samples for a variant on same gene model.

For example:

  • CosmicCompleteExport

    • RHBG_ENST00000368246 ENST00000368246 1374 WSU-HN30 2296299 2161902 upper_aerodigestive_tract head_neck carcinoma squamous_cell_carcinoma y COSM4600365 c.1263+2delC p.? Unknown 37 1:156354348-156354348 + Variant of unknown origin 25275298 cell-line NS TNM:T3N0M0
    • RHBG_ENST00000368246 ENST00000368246 1374 WSU-HN12 2296297 2161900 upper_aerodigestive_tract head_neck carcinoma squamous_cell_carcinoma y COSM4600988 c.1263+2delC p.? Unknown 37 1:156354348-156354348 + Variant of unknown origin 25275298 cell-line NS TNM:T4N1M0
    • ...
  • VCF

    • 1 156354347 COSM4600988 TC T . . GENE=RHBG_ENST00000368246;STRAND=+;GENE=RHBG_ENST00000368246;STRAND=+;CDS=c.1263+2delC;AA=p.?;CNT=1
    • 1 156354347 COSM4600365 TC T . . GENE=RHBG_ENST00000368246;STRAND=+;GENE=RHBG_ENST00000368246;STRAND=+;CDS=c.1263+2delC;AA=p.?;CNT=1
    • 1 156354347 COSM4599196 TC T . . GENE=RHBG_ENST00000368246;STRAND=+;GENE=RHBG_ENST00000368246;STRAND=+;CDS=c.1263+2delC;AA=p.?;CNT=1
    • 1 156354347 COSM4599664 TC T . . GENE=RHBG_ENST00000368246;STRAND=+;GENE=RHBG_ENST00000368246;STRAND=+;CDS=c.1263+2delC;AA=p.?;CNT=1
    • ...
ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6