Question

CNVkit runtime for WGS?

0

Entering edit mode

8.2 years ago

wjar6718 • 0

Hello,

I am a new cnvkit user. It is necessary that I have to use cnvkit with my WGS of a T-N pair (45x and 29x coverages). I am running them with 4 processes (each for 16GB). Now the program is still running for 2 days and I think it is is the fix step (for a whole day now) because I can see my reference.cnn. Is this case usual for you guys?

My command:

./python2.7 cnvkit.py batch Tumor.recal_sort2_dedup2.realigned2.NTrealign.bam \
    --normal Normal_sort_dedup.realigned.recalibrated.NTrealign.bam \
    --rlibpath /cnvkit/R-3.2.3/lib64/R/library:/cnvkit/R-3.2.3/lib64:/cnvkit/R-3.2.3/lib64/R/lib \
    -t data/access-5k-mappable.hg19.bed \
    --fasta hg19_chromosome.fa \
    -g data/access-5k-mappable.hg19.bed --split --annotate data/refFlat.txt -p 4 \
    --output-reference reference.cnn -y

My BAM header:

@HD    VN:1.4    GO:none    SO:coordinate
@SQ    SN:chr1    LN:249250621
@SQ    SN:chr2    LN:243199373
@SQ    SN:chr3    LN:198022430
@SQ    SN:chr4    LN:191154276
@SQ    SN:chr5    LN:180915260
@SQ    SN:chr6    LN:171115067
@SQ    SN:chr7    LN:159138663
@SQ    SN:chr8    LN:146364022
@SQ    SN:chr9    LN:141213431
@SQ    SN:chr10    LN:135534747
@SQ    SN:chr11    LN:135006516
@SQ    SN:chr12    LN:133851895
@SQ    SN:chr13    LN:115169878
@SQ    SN:chr14    LN:107349540
@SQ    SN:chr15    LN:102531392
@SQ    SN:chr16    LN:90354753
@SQ    SN:chr17    LN:81195210
@SQ    SN:chr18    LN:78077248
@SQ    SN:chr19    LN:59128983
@SQ    SN:chr20    LN:63025520
@SQ    SN:chr21    LN:48129895
@SQ    SN:chr22    LN:51304566
@SQ    SN:chrX    LN:155270560
@SQ    SN:chrY    LN:59373566
@RG    ID:Clean3_L7_fix_kmer_q15_TrimN_N0_L70    PU:None    LB:1    SM:T    CN:hcpcg    PL:ILLUMINA
@RG    ID:Clean3_L8_fix_kmer_q15_TrimN_N0_L70    PU:None    LB:1    SM:T    CN:hcpcg    PL:ILLUMINA
@PG    ID:GATK PrintReads    VN:3.4-46-gbc02625    CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false
@PG    ID:MarkDuplicatesCheers,.....                                                                                                                       @PG    ID:bwa.6    VN:0.7.12-r1039    CL:./bwa mem -t 4 hg19_chromosome.fa R1.fastq.gz R2.fastq.gz -M -R @RG\tID:Clean3_L8_fix_kmer_q15_TrimN_N0_L70\tPL:ILLUMINA\tPU:None\tLB:1\tSM:T\tCN:hcpcg

Cheers,
James

CNVkit • 4.4k views

ADD COMMENT • link updated 2.8 years ago by Ram 43k • written 8.2 years ago by wjar6718 • 0

Ram · Answer 1 · 2016-01-21

0

Entering edit mode

8.2 years ago

Eric T. ★ 2.8k

CNVkit version 0.7.5, released on Saturday, has a much faster implementation of the "fix" command that should be more appropriate for WGS.

If you are using CNVkit v0.7.4 or earlier, and you have generated the .cnn files for each sample and the reference, then it's OK to kill the "batch" job now -- most of the work is done. Then update CNVkit to the latest version, and run the "fix" and "segment" commands manually for your two samples.

ADD COMMENT • link 8.2 years ago by Eric T. ★ 2.8k

0

Entering edit mode

Thx Etal, I try it out for your easy-to-use package and will tell you.

Cheers,
James

ADD REPLY • link updated 2.8 years ago by Ram 43k • written 8.2 years ago by wjar6718 • 0

0

Entering edit mode

Hi Etal,

With the WGS data, I have seen the following error when using cnvkit.py fix:

File "/cnvkit/cnvlib/commands.py", line 574, in _cmd_fix
    % (tgt_raw.sample_id, anti_raw.sample_id))
ValueError: Sample IDs do not match:'Clean3_mergedL7L8_150911_FR07887821_targetcoverage' (target) vs. 'Clean3_mergedL7L8_150911_FR07887821_antitargetcoverage' (antitarget)

cnvkit version

./python2.7 cnvkit.py version
0.7.5

My Clean3_mergedL7L8_150911_FR07887821_antitargetcoverage is empty following the online maual:

$head Clean3_mergedL7L8_150911_FR07887821_antitargetcoverage.cnn
chromosome    start    end    gene    log2

My Clean3_mergedL7L8_150911_FR07887821_targetcoverage:

$ head Clean3_mergedL7L8_150911_FR07887821_targetcoverage.cnn

chromosome    start    end    gene    log2
chr1    10000    10266    DDX11L1    9.23584
chr1    10266    10533    DDX11L1    8.62594
chr1    10533    10799    DDX11L1    4.1049
chr1    10799    11066    DDX11L1    4.61658
chr1    11066    11332    DDX11L1    6.19198
chr1    11332    11599    DDX11L1    6.45691
chr1    11599    11866    DDX11L1    7.39473
chr1    11866    12132    DDX11L1    7.32873
chr1    12132    12399    DDX11L1    7.32655

My command:

./python2.7 cnvkit.py fix Clean3_mergedL7L8_150911_FR07887821_targetcoverage.cnn Clean3_mergedL7L8_150911_FR07887821_antitargetcoverage.cnn Clean3_L8_FR07887830_reference.cnn -o Clean3_mergedL7L8_150911_FR07887821.cnr

My error file

ValueError: Sample IDs do not match:'Clean3_mergedL7L8_150911_FR07887821_targetcoverage' (target) vs. 'Clean3_mergedL7L8_150911_FR07887821_antitargetcoverage' (antitarget)

I searched Google to find the way out, but unfortunately it is beyond my knowledge. Could you help me sort this out? Thank you for your precious time.

PS- Clean3_mergedL7L8_150911_FR07887821==Tumor.bam and Clean3_L8_FR07887830==Normal.bam

Cheers, james

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by wjar6718 • 0

0

Entering edit mode

Hi James,

The problem is that the *_antitargetcoverage.cnn and *_targetcoverage.cnn files need to be named *.antitargetcoverage.cnn and *.targetcoverage.cnn instead, i.e. Clean3_mergedL7L8_150911_FR07887821.targetcoverage.cnn and Clean3_mergedL7L8_150911_FR07887821.antitargetcoverage.cnn.

The sample ID is the filename leading up to the first "." character, so there needs to be a "." between "Clean3_mergedL7L8_150911_FR07887821" and "targetcoverage.cnn" or "antitargetcoverage.cnn" for CNVkit to recognize that the samples match. (It's hard-coded this way, sorry).

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Eric T. ★ 2.8k

0

Entering edit mode

I fixed the name, ran cnvkit.py fix and got the following call:

Correcting for GC bias...
Correcting for density bias...
Weighting bins by relative coverage depths in reference
Weighting bins by coverage spread in reference
Processing antitarget: Clean3_mergedL7L8_150911_FR07887821
Traceback (most recent call last):
  File "cnvkit.py", line 11, in <module>
    args.func(args)
  File "/scratch/RDS-SMS-PCaGenomes-RW/weejar/bcbiometasv/miniconda/cnvkit/cnvlib/commands.py", line 576, in _cmd_fix
    args.do_gc, args.do_edge, args.do_rmask)
  File "/scratch/RDS-SMS-PCaGenomes-RW/weejar/bcbiometasv/miniconda/cnvkit/cnvlib/commands.py", line 592, in do_fix
    anti_iqr = metrics.interquartile_range(anti_cnarr.residuals())
  File "/scratch/RDS-SMS-PCaGenomes-RW/weejar/bcbiometasv/miniconda/cnvkit/cnvlib/cnary.py", line 262, in residuals
    return np.concatenate(resids)
ValueError: need at least one array to concatenate"

I read your paper that target regions have low repeats, so I filtered out simple repeats (ucsc hg19) from the access-5k-mappable.hg19.bed and used these filtered access regions as a target to generate non-zero antitarget regions. Then I started over using 'cnvkit.py batch' and the program finished successfully!!!

Can I use these results? Is this method still consistent with your methods in the paper?

Thank you very much for your time. Hope I could use these results because they seem okay.

Cheers,
James

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by wjar6718 • 0

1

Entering edit mode

Sorry, I found the issue and fixed it just now.

For your workaround - good idea! I think these results will still be valid. If you mapped the reads with BWA, that should handle ambiguous read alignments acceptably. In any case CNVkit will downweight or filter out low-quality copy number bins at several steps, and CBS should still do fine if the errors are random.

ADD REPLY • link 8.2 years ago by Eric T. ★ 2.8k

0

Entering edit mode

THX very great program indeed.

PS - WGS ran 3-4 days using -p 12

James

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by wjar6718 • 0