Biostar Beta. Not for public use.
Tutorial: Analyze exome Copy number variation (CNV) in single patient or in population.
34
Entering edit mode
11 months ago
Chirag Nepal ♦ 2.2k
Copenhagen

I have spend sometime trying to understand/analyze exome CNV (copy number variation). There are many tools to do so, but here i am using varscan2, DNACopy and GISTIC.

Varscan2: http://varscan.sourceforge.net/index.html

DNACopy: http://www.bioconductor.org/packages/2.3/bioc/html/DNAcopy.html

Gistic2: ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

http://gdac.broadinstitute.org/runs/analyses__2013_05_23/reports/cancer/KICH-TP/CopyNumber_Gistic2/nozzle.html

Step-1: Run varscan2 on normal-tumor-mpileup and make copy-number call.

java -jar $varscan copynumber normal-tumor-mpileup $outputFile --mpileup 1 --p-value 0.005 --min-coverage 30 --min-map-qual 20 --min-base-qual 20 --data-ratio Calculate_From_Data

The output file has suffix " _.copynumber_ ", for eg, " _filename.copynumber_ ". It contains all the segments above certain threshold, which are generally in large number, ranging close to 1 million per file. The output should further be processed to find significant amplified/deleted CNA or detect reoccurrent SNVs in large cohort. We will first do it for single T/N pair and on the population. Newer version of samtools mpileup file might throw an error while running copynumber.

Step-2: Use CBS binary segmentation in R to find segmented region. Use mergeSegment.pl (provided by varscan) to merge them.

library(DNAcopy)
P1 = read.table ("filename.copynumber", header=TRUE)
CNA.object <- CNA( genomdat = P1[,7], chrom = P1[,1], maploc = P1 [,2], data.type = 'logratio')
CNA.object.smoothed     <- smooth.CNA(CNA.object)
CNA.object.smoothed.seg <- segment(CNA.object.smoothed, verbose=0, min.width=2)
seg.pvalue <- segments.p(CNA.object.smoothed.seg, ngrid=100, tol=1e-6, alpha=0.05, search.range=100, nperm=1000)
write.table (seg.pvalue,       file="cbs_P1_pvalue", row.names=F, col.names=F, quote=F, sep="\t")

# This will not print the p-value.
#segmented = CNA.object.smoothed.seg$output
#write.table (segmented[,2:6], file="cbs_P1",        row.names=F, col.names=F, quote=F, sep="\t")
  • Step-2.1: Input the output file to mergeSegments.pl [downloaded from varscan2]. Note: mergeSegment.pl needs an additional column-2, where "chrom" needs to inserted.
cat cbs_P1_pvalue | awk 'BEGIN {OFS="\t"} { print $1,"chrom",$2,$3,$4,$5,$6,$7,$8,$9,$10 }' > formattedInputFile

perl mergeSegment.pl formattedInputFile --ref-arm-sizes $refArmSize --output-basename filteredOutput

Now, from the output file, significant CN amplified or deleted regions can be detected.

cat filteredOutput | grep -e amplification -e deletion

cat filteredOutput | grep large-scale

cat filteredOutput | grep neutral

Have a closer look at the output file. The output files also reports few regions which are neutral and few large scale deletions.

Step-3: Now, let's try to find recurrent CNV among population, for which i use varscan and GISTIC. The output from varscan copynumber is feed to GISTIC. To run GISTIC, you need to make segment file and marker position. Also refer to the GISTIC documentation ftp://ftp.broadinstitute.org/pub/GISTIC2.0/GISTICDocumentation_standalone.htm

http://genepattern.broadinstitute.org/ftp/distribution/genepattern/modules_public_server_doc/GISTIC2.pdf

  • Step-3.1: Make marker position
cat filename.copynumber | grep -v -e chrom -e chrM -e chrX -e chrY | awk 'BEGIN {OFS="\t"} { print "ID'",$1,$2 }' | sed 's/chr//g' > markerPosition
  • Step-3.2: Make segment file. The output from mergeSegment.pl, which in this example with start with suffix " _filteredOutput_ ' should be used to make segment file.
name=filteredOutput


cat $name | awk 'BEGIN {OFS="\t"} { print "ID",$2,$3,$4,$5,$6 }' | grep -v -e chrM -e chrX -e chrY | sed 's/chr//g' > segmentedFile

Step-4: Run GISTIC. Please read the documentation on how to install/execute GISTIC.

markersfile=markerPosition

segfile=segmentedFile

gp_gistic2_from_seg -b $basedir -seg $segfile -mk $markersfile -refgene $refgenefile -genegistic 1 -smallmem 1 -broad 1 -brlen 0.5 -conf 0.90 -armpeel 1 -savegene 1 -gcm extreme

Step-5: Check the output plots. The amplified and deletion plots and the output text files are ready.

From output file of GISTIC, read information from this site.

http://www.genomespace.org/support/guides/recipes/sections/identify-biological-functions-for-genes-in-cnv-regions

Please feel free to contribute. I will update when i find better solution.

ADD COMMENTlink
3
Entering edit mode
Do not use on-target data, use off-target data! See Revolution (?) in CNA detection using exome/targeted sequencing
ADD REPLYlink
1
Entering edit mode

I edited the best I could to make this more readable. Though, you might want to add some more links/background on the approach and explain some of the variables being used (just a suggestion).

ADD REPLYlink
0
Entering edit mode

I think $name at step 3.2 need to be replaced with actual file name.

ADD REPLYlink
0
Entering edit mode

I use Varscan on exome data, so it seems to me that I should the exome target list as Marker list - maybe by taking the exon mid point (since markerPosition only allows one coordinate). Thoughts?

Incidentally, I've found that using the exome target list as whitelist in samtools mpileup greatly reduces noise in the Varscan output.

ADD REPLYlink
0
Entering edit mode

Can i use this method on WGS data?

ADD REPLYlink
0
Entering edit mode

It would be very inefficient. I'd recommend CNVnator instead for WGS.

ADD REPLYlink
0
Entering edit mode

Thank you. After that run GISTIC for CNAs?

ADD REPLYlink
0
Entering edit mode

GISTIC is normally used to find recurrent CNV among patients, and there are other tools as well to find recurrent CNVs. First, you need to find enriched CNV segments for each patient, using anytool (CNVnator or any other tools), then you can use GISTIC2 to find recurrent CNV.

ADD REPLYlink
0
Entering edit mode

Thank you

But since GISTIC need segmentation file and marker file which require number of markers to be filled in the segmentation file and the marker position in the marker file. Seg.CN i can get from the CNV segments from CNVnator. So, was thinking if CNVnator output provides these information.

ADD REPLYlink
0
Entering edit mode

There are many tools to find recurrent CNVs but GISTIC is the most popular i guess

ADD REPLYlink
0
Entering edit mode

Hi rse, did you figure out a way to run GISTIC2 after calling CNVs with CNVnator?

ADD REPLYlink
0
Entering edit mode

No, i could not run GISTIC2

ADD REPLYlink
0
Entering edit mode

Thanks for your process.I have a question,could please interpret why should exclude chrM chrY chrX in the gistic analysis?

ADD REPLYlink
0
Entering edit mode

Sorry,

Could I use copy number in .csv format from caveman for gistic input?

ADD REPLYlink
0
Entering edit mode

Thank you for providing this tutorial as I found tailoring this together is very hard for me

Could I ask from where I can provide refArmSize please?

ADD REPLYlink
2
Entering edit mode
4.5 years ago
wzlnot • 20
China

Replenish some detail and suggests:

1)In DNAcopy output part, the col.names and row.names suggested to set Ture . The mergeSegments.pl can hangdle this.

2) the chromosome name between --ref-arm-sizes file and DNAcopy output should be consistent. if not, the mergeSegments.pl will malfunction.

3) filter the un-located scaffolds or mtDNA if them exist. mergeSegments.pl assign the hash key with the chromosome name in --ref-arm-size file, and call hash value by use the chromsome name in DNAcopy output. If chromsome of DNAcopy is not in --ref-arm-size file, then cause the "Use of uninitialized value ..." error.

so, suggest:

cat CNA.smoothed.segs.pvalue | awk -F "\t" '$3~/chr[1-9,XY]/{print}' > mergeSegments.input
ADD COMMENTlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1