Question

Ploidy Correction In NGS cancer data (human)

3

Entering edit mode

4.9 years ago

German.M.Demidov ★ 2.9k

Hi,

I have a callset from cancer NGS data (tumor/normal design, ~100 samples) and I have CNVs called there. Additional complexity is - some samples are close to tri- and tetra-ploid state.

I study impact of genomic variants on therapy response.

In some papers I've met definition "ploidy corrected CNVs" - but without clear explanation how it was done. If I understood correctly, people take average ploidy across the genome and consider everything that is below 0.5 copies below this baseline as deletion and 0.5 above the ploidy baseline as duplication. So diploid gene X in tetraploid sample is considered as largely deleted.

Questions:

1) How to correctly perform ploidy correction?

2) Should we correct for ploidy for largely deleterious tumors? E.g. where average ploidy is 1.5 (half of genome was deleted). Common sense tells me that no, but I wander if it is correct.

3) Bonus question: do you know btw how to take clonality (expressed as cancer cell fraction) of variant into account? Duplication in 10% of cells seems to affect therapy response less than duplication in 90%. But I also doubt that effect of CCF is linear.

upd: there is a related question, Ploidy in copy number analysis , however it is not exactly about this. The answer in the question is "Otherwise, I agree with you, calling relative to ploidy doesn't make a lot of sense biologically." - but I see that what is important is relative gene dosage, not absolute, and if everything is tetraploid - we may expect some sort of gene product balancing there.

Upd1: I apologize for wrong usage of terminology and I am blocked from any responses until tomorrow since I have reached 5 posts limit on this website.

CNV ploidy cancer • 3.7k views

ADD COMMENT • link 4.9 years ago by German.M.Demidov ★ 2.9k

score 3 · Accepted Answer · 2019-06-07

3

Entering edit mode

4.9 years ago

markus.riester ▴ 550

Since you link an old response of mine:

1). When you say you have CNV calls, I assume segmented log2 tumor/normal coverage ratios, obtained by tools like CNVkit, Control-Freec, GATK etc.. The measured log2-ratios are a function of copy number, purity and ploidy; copy number obviously increases coverage linearly, purity also increases coverage by eliminating coverage coming from normal cells that do not have the CNA, and ploidy shifts log-ratios (https://www.biostars.org/p/381620/#381677). So ploidy adjustment needs to jointly calculates these three factors. Tools like ABSOLUTE, ASCAT, FACETS, Sequenza, Batternberg, and many others do that for you.

2). Every ploidy different from 2 will result in a shift in log2-ratios and, again, tumor purity affects the scaling, so yes.

3).The allelic fraction of a variant is a function of purity, multiplicity (number of chromosomes harboring the mutation), total copy number, CCF, a small mapping bias (non-reference alleles often have a smaller chance of passing filters) and sampling noise. The tools mentioned above will provide you the purity and allele-specific copy numbers. Getting the multiplicity is essentially just trying all likely ones and then picking the best one for example. Tools like PyClone can provide more accurate CCFs by clustering CCFs, thus reducing sampling noise.

There is a ton of literature that should help you finding the best solution for your data.

ADD COMMENT • link 4.9 years ago by markus.riester ▴ 550

0

Entering edit mode

Thanks for the answer! I use my own tool ClinCNV, it does all these things. I just wander how to interpret the data when I want to put calls in form (sample purity, sample ploidy, gene copy number, ccf of this gene copy number) into machine learning model. Because some ways of adjustment for ploidy provide me good results with lasso regression and some not, and it is not clear from the biological point of view what is deletion and what is duplication in non diploid sample. I understand your point 2), but for me sample with ploidy 1.25 is totally different from sample 2.5 even if it looks just a linear scaling - first one does not have a full diploid genome, so it looks like we should treat them differently, but idk if there are any literature sources for that.

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

In a regression, I would probably just use a gene-level log-ratio, ideally purity and ploidy adjusted, see for example Zack et al.. CCFs of CNAs are tricky, except maybe for single deletions and gains, so be careful here, GIGO. If ploidy adjustment has a big impact, then maybe what you are seeing is that genome doublings are correlated with your outcome of interest.

ADD REPLY • link 4.9 years ago by markus.riester ▴ 550

0

Entering edit mode

It looks like you came here for a discussion as opposed to having any particular question answered (?). You say you have your own tool, so, you obviously already understand the very question that you have asked (?). The answer by Markus is very good.

I personally only use Control-FREEC for everything copy number related these days. I actually still see some people merely comparing log2 ratios, like, just dividing them, and calling that copy number, even in cancer samples. Control-FREEC adjusts for purity, ploidy, and GC content, too.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

I am quite confident in terms of cna calling. What I don't understand is the interpretation. I will try to re formulate my question. When we have tetraploid sample and gene X (and only gene X) has copy number 3 - is it deletion or duplication? How should I transform it to use as input for machine learning model which uses tumor growth as a response to therapy as Y?

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Raw CN 3 in tetraploid would be 1 copy deletion, or, if the 3 relates to a linear ratio to a normal sample (also tetraploidy), then it is a raw CN of 12. To me, it makes more sense to use ploidy-corrected log2 ratios as input to any downstream modeling, but I have never exactly worked extensively in this area.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Yeap, and this is the thing... Samples are rarely clearly integer number ploid, e.g. https://iovs.arvojournals.org/article.aspx?articleid=2518413, fig 2, and like 20 percents of samples are polyploid, so throwing them out would be strange and our collaborators will not understand why we threw away 20 percents of their money and efforts :(

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

I don't understand why you would have to throw away the samples? I would just aim to adjust for the ploidy. You could even do some clonality analysis, broadly speaking. Some people I knew at UCL Cancer Institute have done some great work in that area.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

thanks for the answer! do you know any papers from that UCL lab? And I still don't understand what does it mean to adjust for ploidy. Is it "to establish baseline correctly"?

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

Well, Im on one manuscript that is currently under review with them. The particular one to which I was referring though, is not yet published (and maybe not even submitted), but it related to being able to define a lineage of the clonotypes that exist in a tumour.

Yes, as far as I am aware, adjusting for ploidy means to bring everything to baseline.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

You mead adjusting the measured log2-ratio for purity and ploidy? ABSOLUTE and similar algorithms calculate integer copy numbers for all segments. If you need adjusted log-ratios, you can do some simple algebra (see Zack et al.). End result is that diploid regions are 0 centered (because log2(2/2)), single losses at -1 (log2(1/2)), etc., independent of purity and ploidy.

ADD REPLY • link 4.9 years ago by markus.riester ▴ 550

0

Entering edit mode

They did a great job. I'll try to find the method Zack described in his paper (he cites it as unpublished) about maximum parsimoneous CNA sequence of events reconstruction. Hopefully it will help.

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

I mean, may be i explained it wrong, but basically my tool identifies diploid parts and perform calling with relation to diploid (just like facets does), so in theory no need for ploidy adjustment, I just thought that such adjustment is performed for better interpretability, not for better calling :(

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

We commented at the same time. It may be reasonable to assume diploid, but obviously you just have to state that in your methods. Not all tumours exhibit large-scale aberrations, after all. The way that Control-FREEC tests ploidy is by the user merely selecting ploidy states to test. The algorithm will then test each over each region, and choose the ploidy that best explains that data at hand.

ADD REPLY • link 4.9 years ago by Kevin Blighe 87k

0

Entering edit mode

Yeah, I am asking about tumors that exhibit such almost whole genome alterations :( I just can not throw them away because they are polyploid and there is no concensus, these samples were obtained with too much efforts from my colleagues. ControlFreec as far as I know does not work with heterogeneous tumors, but since all my samples passed several rounds of therapy - all of them consist of multiple clones due to selective pressure :(

ADD REPLY • link 4.9 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

Yes, this FACETS heuristic to find the diploid state works very well in the majority of cases, but indeed fails in cases without clear diploid state (it warns you though). The alternative is trying all reasonable purity and ploidy cases (e.g. what Sequenza and many others do), at the cost of larger runtime. Most tools support at least some degree of heterogeneity.