Question

Information from dbNSFP in oncotated MAFs

0

Entering edit mode

7.6 years ago

naxerova ▴ 20

Hi everyone,

I am processing oncotated MAFs downloaded from Firehose, and I am particularly interested in extracting dbNSFP information for each missense mutation.

However, I am pretty confused by the format. Here is a common example. For a missense mutation in ABLIM1, the i_dbNSFP_Ensembl_transcriptid column lists the following transcripts that overlap with the mutation:

ENST00000336585;ENST00000369252;ENST00000392952;ENST00000369257;ENST00000369267;ENST00000533213; ENST00000369262;ENST00000369263;ENST00000369266;ENST00000369256;ENST00000369260; ENST00000277895;ENST00000369253;ENST00000428430|ENST00000392955

When I now look at the columns that should be showing the functional impact for each transcript, I find the following, .e.g in the i_dbNSFP_Polyphen2_HVAR_score column:

0.861;0.603;0.279;0.893;0.887;0.999;0.992;0.873;0.595|.

There are 15 transcripts, but only 10 predictions. I see no way of parsing which prediction score belongs to which transcript, and of course a bunch are missing.

This is just one example of too many to count. Am I misunderstanding how this information is structured?

I would appreciate any help!!

Thanks so much. Kamila

Oncotator dbNSFP • 1.6k views

ADD COMMENT • link updated 7.0 years ago by Matthew_Maher • 0 • written 7.6 years ago by naxerova ▴ 20

score 0 · Answer 1 · 2017-06-07

Here's my two-cents, which may help (or may be wrong!):

dbNSFP attempts to combine lots of different data sources, which have differing degrees of cardinality/underlying-key-structure - some are unique to a position, some to a variant, some to a transcript, some to a protein, etc. etc. The dbNSFP documentation does not (as far as I can tell) clearly explain what is the corresponding key to watch for in each column when there is a list of multiple values as in the example you cite. But but analyzing the count-correlations between columns (which I've done) and also reviewing the definition of the source data your interested in (e.g. PolyPhen in your example), you can usually figure things out (but it would be nice if their doc page were more explicit).

You asked about Polyphen. The polyphen web site ( http://genetics.bwh.harvard.edu/pph2/ ) is driven by a Protein ID (not a transcript ID!). And if, in the dbNSFP rows in question, you were to look at the column called Uniprot_acc, (i.e. Protein IDs), I believe you will find a list of Protein IDs whose length equals the length of the list of the Polyphen predictions.

I hope that helps (and is correct)