Question

dNdScv error: Zero coding substitutions found in this dataset

0

Entering edit mode

5.2 years ago

zizigolu ★ 4.3k

Hi,

I have a set of whole genome sequencing data, I want to find driver mutations so I am trying to use dndscv tool. But, I am permanently getting this error

> dndsout = dndscv(mutation)
[1] Loading the environment...
[2] Annotating the mutations...
Error in dndscv(r) : 
  Zero coding substitutions found in this dataset. Unable to run dndscv.
In addition: Warning messages:
1: In dndscv(r) :
  Mutations observed in contiguous sites within a sample. Please annotate or remove dinucleotide or complex substitutions for best results.
2: In dndscv(r) :
  Same mutations observed in different sampleIDs. Please verify that these are independent events and remove duplicates otherwise.
3: In .Seqinfo.mergexy(x, y) :
  The 2 combined objects have no sequence levels in common. (Use
  suppressWarnings() to suppress this warning.)
>

This is the header and tail of my data

> head(mutation)
  sampleID  chr      pos ref  mut
1 Sample_1 chr1  7914928   A ATAC
2 Sample_1 chr1 15080062   A   AG
3 Sample_1 chr1 16714609   C  CAT
4 Sample_1 chr1 19078857  TA    T
5 Sample_1 chr1 22621124  TA    T
6 Sample_1 chr1 34556929  TA    T
> tail(mutation)
      sampleID    chr      pos ref mut
85912 Sample_4 hs37d5 33864042   A   G
85913 Sample_4 hs37d5 34253273   C   A
85914 Sample_4 hs37d5 35186366   A   C
85915 Sample_4 hs37d5 35222851   A   G
85916 Sample_4 hs37d5 35232459   A   C
85917 Sample_4 hs37d5 35295511   A   G
>

Anyone knows how I can solve this error?

dndscv R WGS • 4.4k views

ADD COMMENT • link updated 5.2 years ago by Inigo Martincorena ▴ 220 • written 5.2 years ago by zizigolu ★ 4.3k

3

Entering edit mode

5.2 years ago

Ram 43k

Zero coding substitutions found

There you go - it’s a clear error message.

ADD COMMENT • link 5.2 years ago by Ram 43k

0

Entering edit mode

Sorry could you please help me to see what does this say?

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

2

Entering edit mode

Try and answer the following questions yourself:

What does the tool do?
What sort of mutations does it absolutely need to perform its job?
What do you think the error message says?

ADD REPLY • link 5.2 years ago by Ram 43k

0

Entering edit mode

5.2 years ago

zx8754 11k

This tool only works with SNV, meaning in your input dataframe the columns ref and alt can only have below letters:

c("A","C","G","T") - see #L116

And the error is triggered at this step:

if (!any(snv)) { stop("Zero coding substitutions found in this dataset. Unable to run dndscv.") } - see #L175

ADD COMMENT • link 5.2 years ago by zx8754 11k

0

Entering edit mode

Thank but says this tool also can handle the Indels :-(

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

An Indel is a Short Nucleotide Variant you are mixing it up with a SNP. Look at the code he provided and Rams answer.

mutations$strand = sapply(RefCDS,function(x) x$strand)[mutations$geneind]
snv = (mutations$ref %in% nt & mutations$mut %in% nt)
if (!any(snv)) { stop("Zero coding substitutions found in this dataset. Unable to run dndscv.") }

The keyword is RefCDS. You have no SNVs in annotated coding regions. Your data is the problem. Check your data why there is no overlap with annotated coding regions.

ADD REPLY • link 5.2 years ago by ATpoint 81k

Ram · Accepted Answer · 2019-02-12

3

Entering edit mode

5.2 years ago

Inigo Martincorena ▴ 220

Thanks for trying the dNdScv package and the feedback. I suspect that the problem is that your chromosome names are different to those used by dNdScv. Your chromosome names contain the prefix "chr", while the reference object used by dNdScv for GRCh37/hg19 does not use this. As a result, the method does not find any valid coding mutation in your data. Indels are handled by dNdScv, so they should not be a problem.

Try this:

mutation$chr = gsub("chr","",as.vector(mutation$chr))
dndsout = dndscv(mutation)

I will try to improve the error message generated when the chromosome names do not match to avoid this confusion in the future.

ADD COMMENT • link 5.2 years ago by Inigo Martincorena ▴ 220

1

Entering edit mode

Thank you for your time assigning to my problem, without asking for your help for sure for days I had to wandering in google but I know I would not find the solution myself

My data not only contained chr prefix but also something like hs37d5 and chrM and so on, so by your kindly comment I removed all then code worked. However, I am not sure if it is right to ignoring these contigs.

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

I think it is fine to dismiss those contigs to study selection and to search for cancer genes. However, if you are very interested in including these contigs in the analysis, you can build your own RefCDS object using the buildref function in the package (see a tutorial below). But I would not worry about this unless you have a particular interest in these contigs. http://htmlpreview.github.io/?http://github.com/im3sanger/dndscv/blob/master/vignettes/buildref.html

The mitochondrial genome (chrM) is a special one and cannot be analysed with the rest of the genome. The reasons are that the genetic code is different (which matters when running dN/dS) and that the mitochondrial genome has very different mutation rates and spectrum that need to be modelled separately. For cancer gene discovery it is reasonable to exclude the mitochondrial genome, however, if you are interested in studying selection in the mitochondria you can again use the buildref function to build an object for it. The mitochondria has an additional challenge in that there is a very strong replication strand asymmetry. So, if you were interested in mitochondria, I would recommend running dNdScv separately for the MT-ND6 gene (see paper below, but note that the dN/dS model that we used in this paper was developed before dNdScv). https://elifesciences.org/articles/02935

But I think that almost all users can rely on the default object from dNdScv and not worry about the additional contigs or about the mitochondrial genome.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry,

For getting better results, do you suggest me to use original vcf files as input for your software or I should first filtrate vcf file for minor allele frequency and variant with deleterious impact of the protein?

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

I would need to know more about your project to give you a good answer. It all depends on how clean your table of somatic mutations is. Many false positives and artefacts tend to have low allele frequencies, and filtering them out may improve the results. However, if your dataset is clean, you don't need to do that. You should also never filter your mutations based on their impact to the protein before running dNdScv, because the method expects an unbiased dataset (e.g. dN/dS needs synonymous mutations to model the mutation rates).

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Thanks a lot, very very appreciated as I was struggling especially about the filtration based on influencing proteins.

My project is responding or non-responding to chemotherapy in oesophagus adenocarcinoma; So, we have whole genome sequencing of responder patients (tumour vs normal tissue) and matched non-responder patients to chemotherapy. Being responders or non-responders has been assessed by mandard scoring (TRG). However, as different patients carrying different and heterogeneous tumours, I am not sure if finding the driver genes could help in answering what is the difference in the genomes of responders versus non-responders.

The authors of

The landscape of selection in 551 esophageal adenocarcinomas defines genomic biomarkers for the clinic highly recommended me to use dNdScv as their best performing tool, so I hope I could find a signature in responders and non-responders to chemotherapy by finding significant driver genes. Please kindly give me a clue if I am in wrong path to answer my question.

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

You can certainly try dNdScv, running it separately on each set of patients and on both sets together. This will give you lists of significant genes and some measure of selection. However, dNdScv is not designed to look for differences between two cohorts, so you may want to consider alternative approaches (e.g. statistically compare the fraction of patients with a driver mutation in a given gene in both sets of patients). I would also advice you to contact the authors of that paper for advice, since they will be more familiar with the type of cancer that you are analysing.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry,

I likely analyzed my data with dNdScv successfully; However, I am not sure how I could achieve three ouputs

1- For getting the percentage of driver mutations in each gene and in different mutation classes I am not sure which table gives me these. For example

> print(head(sel_cv), digits = 3)
      gene_name n_syn n_mis n_non n_spl wmis_cv wnon_cv wspl_cv
18057      TP53     0    14     2     1   247.5     622     622
3794      CHST2     0     0     2     0     0.0    1006    1006
16660     SPTA1     0     9     0     0    20.3       0       0
19410   ZDHHC17     0     0     0     2     0.0     268     268
9327   KRTAP2-1     0     0     1     0     0.0   11304   11304
19877     ZNF66     0     3     0     0    36.8       0       0
       pmis_cv ptrunc_cv pallsubs_cv qmis_cv qtrunc_cv qallsubs_cv
18057 0.00e+00  5.04e-08    0.00e+00  0.0000   0.00101      0.0000
3794  7.32e-01  2.40e-06    1.32e-05  0.9353   0.02407      0.0905
16660 2.75e-06  7.46e-01    1.35e-05  0.0276   0.99405      0.0905
19410 7.26e-01  3.78e-05    1.83e-04  0.9353   0.25348      0.9211
9327  9.51e-01  5.48e-05    2.92e-04  0.9616   0.27548      0.9999
19877 2.96e-04  9.18e-01    1.41e-03  0.9353   0.99405      0.9999
> print(dndsout$globaldnds)
     name       mle     cilow   cihigh
wmis wmis 0.9989565 0.9198342 1.084885
wnon wnon 1.0253553 0.8406774 1.250603
wspl wspl 1.1058024 0.8414807 1.453152
wtru wtru 1.0509338 0.8894579 1.241725
wall wall 1.0025578 0.9241182 1.087655

If I am not wrong for TP53 the class of mutation is missence?

2- Output of the mean number of exonic driver mutations per case

3- Passenger mutation rates

I am sorry to be this much disturbing

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Hello, The outputs that you are asking for are not standard outputs from dndscv. They can be obtained with some additional calculations based on the outputs of dNdScv and following the methods of my paper in Cell. However, the risk of these measures being misused or misinterpreted by users without a good understanding of dNdScv was so high, that I decided not to make them part of the standard output.

Answering your questions specifically:

Confidence intervals for dN/dS values per gene are not available and they would ideally be needed for the question that you ask. I will consider adding them to future versions of dNdScv. In the meantime, note that the columns wmis_cv and wnon_cv are the maximum likelihood estimates (point estimates) of the dN/dS values per gene. For example, for TP53 dN/dS ratios for missense and truncating (nonsense + essential splice site) mutations are 247.5 and 622, respectively. dN/dS ratios measure the excess of mutations observed over those expected by chance under neutrality. So, you can calculate the excess of missense mutations in TP53 as: (247.5-1)/247.5*14 = 13.94. That is, the rate of passenger missense mutations expected in TP53 under neutrality was very low (<<1) and most likely all 14 missense mutations seeing in TP53 are drivers. As I said, ideally one would use confidence intervals for the dN/dS values to add uncertainty to these calculations. Also, note that in the absence of confidence intervals, I would only calculate these values for genes with q-value<0.05 (in this case only TP53).
The global estimate for the number of driver mutations per exome can be obtained using dndsout$globaldnds with the code below. Where nmuts is the number of mutations of each class (missense and nonsense+essential_splice) and nsamples is the number of samples (or exomes) studied. Note, however, that a number of assumptions are used in these calculations, so I would not use any of this unless you are confident that you understand the methodology and the principles of dNdScv and the suitability of your data.
```
dnds = dndsout$globaldnds[c(1,4),2:4] # Global dN/dS estimates with CI95% for missense and truncating subs 
dnds[dnds<1] = 1 # Bounded by 1 (since we are estimating excesses) 
drivers_per_sample = (dnds-1)/dnds[,1]*nmuts/nsamples
```
Not sure what you mean by "passenger mutation rates". But the tutorial contains information that may be useful in this direction.

If you are trying to recapitulate the analyses of a given paper, like the one you mentioned before on esophageal adenocarcinomas, I would advice you to directly contact the authors for advice (and code) to reproduce their analyses.

I hope it helps. -Inigo

ADD REPLY • link updated 5.2 years ago by Ram 43k • written 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Thank you very much to be this much patient and helpful. Actually I have my own independent data with different question than the paper mentioned initially. The authors just suggested your software as their best performing tool for finding genes under positive selection. However I will need a lot of time to grab what you kindly described here. Sorry but now,

By q-value only a few genes will remain for me (one or two gene including TP53), could I instead use p-value for reporting a list of possible driver genes ?

Thank you again

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Not really, when calculating p-values for 20,000 genes, you expect many (~1,000) to have p-values<0.05 by chance. q-values are corrected for multiple testing to avoid this and so q-values should always be used. You can use a less conservative q-value cutoff (e.g. 0.10) but I would be careful as you can start to include false positives in your list of significant genes.

There is a better way of increasing power to detect significance in known cancer genes, and that is running dNdScv on a list of known cancer genes. Or else, perform FDR correction (p.adjust function) on a list of cancer genes using the output from dNdScv (which is a better way of doing it as you use all genes to run dNdScv). In any case, this will only increase the power on already known cancer genes and needs to be reported as Restricted Hypothesis Testing (see example)

library("dndscv")
data("cancergenes_cgc81", package="dndscv") # Genes in the Cancer Gene Census (v81)
dndsout = dndscv(mutations, gene_list=known_cancergenes)

ADD REPLY • link updated 5.2 years ago by Ram 43k • written 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry @ Inigo Martincorena

By my too little information, a gene is being considered as driver if be mutated in a wide range of samples; For instance in mentioned paper they are ignoring significant driver genes carrying less than 7 mutations. In my data even for a gene with significant overall q-value I have 2 or maximum three samples having this mutation; So, in your opinion please, could I still conclude that any of the genes in my results is a driver with this information?

Thank you so much anyway

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Can you show the output table (sel_cv) for this gene (feel free to remove the gene name if you want to keep it anonymous)? It is hard to tell without seeing the data, since there can be artefacts affecting the reliability of some results.

But, yes, dNdScv can sometimes successfully identify recurrently mutated genes with very few mutations. For example, considering the number of samples, the size of a gene, its sequence and the trinucleotide mutation rates in a dataset, the model may estimate that only 0.001 mutations are expected in a given gene by chance. If you saw 3 mutations in that gene, its significance could be very high.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Thank you very much

> print(head(sel_cv), digits = 3)
      gene_name n_syn n_mis n_non n_spl wmis_cv wnon_cv wspl_cv  pmis_cv ptrunc_cv pallsubs_cv qmis_cv qtrunc_cv
18057      TP53     0    14     2     1   247.5     622     622 0.00e+00  5.04e-08    0.00e+00  0.0000   0.00101
3794      CHST2     0     0     2     0     0.0    1006    1006 7.32e-01  2.40e-06    1.32e-05  0.9353   0.02407
16660     SPTA1     0     9     0     0    20.3       0       0 2.75e-06  7.46e-01    1.35e-05  0.0276   0.99405
19410   ZDHHC17     0     0     0     2     0.0     268     268 7.26e-01  3.78e-05    1.83e-04  0.9353   0.25348
9327   KRTAP2-1     0     0     1     0     0.0   11304   11304 9.51e-01  5.48e-05    2.92e-04  0.9616   0.27548
19877     ZNF66     0     3     0     0    36.8       0       0 2.96e-04  9.18e-01    1.41e-03  0.9353   0.99405
      qallsubs_cv
18057      0.0000
3794       0.0905
16660      0.0905
19410      0.9211
9327       0.9999
19877      0.9999
>

For instance, here CHST2 is significant (q-trunc_cv) but in compared to TP53 with a lot of mutations, this gene carries very few mutations

ADD REPLY • link 5.2 years ago by zizigolu ★ 4.3k

1

Entering edit mode

To avoid incurring into multiple testing problems, I would recommend using "qallsubs_cv", which combines the evidence from missense and truncating substitutions. You can see that TP53 has a very low qallsubs_cv (<0.001), while CHST2 is borderline or non-significant (0.0905). The q-value that you are referring to (qtrunc_cv) is the combined p-value for truncating substitutions (nonsense and essential splice site mutations). As you can see in the n_non and n_spl columns, there are 3 truncating substitutions in TP53 and 2 in CHST2, which is consistent with their q-values for truncating substitutions being 0.001 and 0.02 respectively. Also, note that the size of the gene, its sequence and the mutational signatures are taken into consideration to calculate p-values, and not just the absolute number of mutations.

ADD REPLY • link 5.2 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry to be too much silly in questioning;

Could I run your software for finding non-coding regions of human genome?

For instance likely

https://elifesciences.org/download/aHR0cHM6Ly9jZG4uZWxpZmVzY2llbmNlcy5vcmcvYXJ0aWNsZXMvMjE3NzgvZWxpZmUtMjE3Nzgtc3VwcDQtdjMueGxzeA==/elife-21778-supp4-v3.xlsx?_hash=KQi5jfO3kT2c4Qw44j4Rg6YAyCBQilYuWHVYXcRDuuo%3D

Gives non-coding parts of genome

ADD REPLY • link 5.1 years ago by zizigolu ★ 4.3k

1

Entering edit mode

Do you mean running an analysis to detect significantly mutated non-coding regions? Unfortunately, the dNdScv package only works on coding mutations at the moment, although I may published a new function (called NBR) to do this in the near future. There are some methods out there to do what you say, but they tend to give long lists of false positives (see the paper below): https://www.biorxiv.org/content/10.1101/237313v1

ADD REPLY • link 5.1 years ago by Inigo Martincorena ▴ 220

0

Entering edit mode

Sorry for coming back again

By strelka I have called SNV on my samples

Now I have one .vcf file contains SNV for each cancer-organoid pair of each patient

I spitted this .vcf file to two .vcf files as ORGANOID and TUMOR

I want two .vcf files for my paired samples because I need to check the consistency of mutations in two paired samples For instance for TP53 I want to know if this gene was mutated in both samples or just one of them

by dndsout$annotmuts I have samples and genes mutated in samples, I am wondering is this the same I need to find the consistency between significantly mutated genes for a pair of samples?

Thank you so much for your time

ADD REPLY • link 5.1 years ago by zizigolu ★ 4.3k