How do I interpret alternative allele <CN2> for structural variants in 1000 genomes VCF?
1
1
Entering edit mode
7.3 years ago
Mr. Dave ▴ 50

I'm trying to interpret copy number as described in 1000 genome's phase3 integrated call set. Here are some relevant lines from the VCF header:

##fileformat=VCFv4.1
##contig=<ID=1,assembly=b37,length=249250621>
##ALT=<ID=CNV,Description="Copy Number Polymorphism">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:LINE1,Description="Insertion of LINE1 element">
##ALT=<ID=INS:ME:SVA,Description="Insertion of SVA element">
##ALT=<ID=INS:MT,Description="Nuclear Mitochondrial Insertion">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CN0,Description="Copy number allele: 0 copies">
##ALT=<ID=CN1,Description="Copy number allele: 1 copy">
##ALT=<ID=CN2,Description="Copy number allele: 2 copies">
##ALT=<ID=CN3,Description="Copy number allele: 3 copies">
##ALT=<ID=CN4,Description="Copy number allele: 4 copies">
{...}
##ALT=<ID=CN124,Description="Copy number allele: 124 copies">
##INFO=<ID=CS,Number=1,Type=String,Description="Source call set.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">
##INFO=<ID=MC,Number=.,Type=String,Description="Merged calls.">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">

I have filtered the data to only variants with an SVTYPE INFO tag set to DUP, DEL or CNV.

  • Records with INFO/SVTYPE=DEL generally have ALT=<CN0>, but occasionally show ALT=<CN0>,<CN2>. In these cases, there are no calls for <CN2>.
  • Records with INFO/SVTYPE=DUP generally have ALT=<CN2>, but occasionally show ALT=<CN0>,<CN2>. In these cases, there are no calls for <CN0>.
  • Records with INFO/SVTYPE=CNV show a variety of combinations.

Here is a summary of the variants in the above file filtered down to the 3 SVTYPES, with accompanying totals:

#  N SVTYPE ALT
6026 DUP    <CN2>
 100 DUP    <CN0>,<CN2>
   3 DUP    <CN2>,<CN3>

33329 DEL   <CN0>
   7 DEL    <CN0>,<CN2>
   6 DEL    G
   3 DEL    A
   2 DEL    C
   2 DEL    T
   1 DEL    TGGTTCATTGATATTCTGCTGTGGCAC{..Truncated..},T

2716 CNV    <CN0>,<CN2>
 136 CNV    <CN2>,<CN3>
  90 CNV    <CN0>,<CN2>,<CN3>
  50 CNV    <CN2>
  35 CNV    <CN0>,<CN2>,<CN3>,<CN4>
  23 CNV    <CN0>,<CN2>,<CN3>,<CN4>,<CN5>
  23 CNV    <CN2>,<CN3>,<CN4>
  12 CNV    <CN0>
   9 CNV    <CN0>,<CN2>,<CN3>,<CN4>,<CN5>,<CN6>
   8 CNV    <CN2>,<CN3>,<CN4>,<CN5>
   4 CNV    <CN3>,<CN4>
   3 CNV    <CN0>,<CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>
   3 CNV    <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>
   2 CNV    <CN0>,<CN1>,<CN3>,<CN4>
   2 CNV    <CN1>,<CN3>
   2 CNV    <CN1>,<CN3>,<CN4>,<CN5>
   1 CNV    <CN1>,<CN3>,<CN4>
   1 CNV    <CN1>,<CN3>,<CN4>,<CN5>,<CN6>
   1 CNV    <CN2>,<CN3>,<CN4>,<CN5>,<CN6>
   1 CNV    <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>,<CN8>
   1 CNV    <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>,<CN8>,<CN9>
   1 CNV    <CN3>

So, how do I interpret <CN2> when INFO/SVTYPE is DUP or CNV? Despite the header's description of <CN2>, it seems that it should describe a biallelic duplication when INFO/SVTYPE=DUP, this idea is makes sense in reading the article. Does the header only apply when INFO/SVTYPE=CNV?

INFO/SVTYPE=DEL examples (INFO truncated):

1    738570       esv3584979      G    <CN0>          100    PASS    
    AC=1;AF=0.000199681;AN=5008;CS=DEL_union;END=742020;NS=2504;SVTYPE=DEL;VT=SV
1    766600       esv3584980      G    <CN0>          100    PASS    
    AC=188;AF=0.0375399;AN=5008;CS=DEL_union;END=769112;NS=2504;SVTYPE=DEL;VT=SV
2    50182899     esv3590712;.    A    <CN0>,<CN2>    100    PASS    
    AC=3,0;AF=0.000599042,0;AN=5008;CS=DUP_uwash;END=50192857;NS=2504;SVTYPE=DEL;VT=SV
3    138606780    esv3597927;.    T    <CN0>,<CN2>    100    PASS    
    AC=1,0;AF=0.000199681,0;AN=5008;CS=DUP_gs;END=138620917;NS=2504;SVTYPE=DEL;VT=SV

INFO/SVTYPE=DUP examples:

1       668630      esv3584976      G    <CN2>          100    PASS    
    AC=64;AF=0.0127796;AN=5008;CS=DUP_delly;END=850204;NS=2504;SVTYPE=DUP;VT=SV
1       16013837    esv3585317      T    <CN2>          100    PASS    
    AC=11;AF=0.00219649;AN=5008;CS=DUP_delly;END=16080976;MC=DUP_uwash_chr1_16012226_16082907;SVTYPE=DUP;VT=SV
1       16037975    .;esv3585319    G    <CN0>,<CN2>    100    PASS    
    AC=0,11;AF=0,0.00219649;AN=5008;CS=DUP_gs;END=16071850;SVTYPE=DUP;VT=SV
1       153682976   esv3587592;.    G    <CN2>,<CN3>    100    PASS    
    AC=194,0;AF=0.038738,0;AN=5008;CS=DUP_gs;END=153696281;SVTYPE=DUP;VT=SV

INFO/SVYPE=CNV examples:

1    1609210      esv3585011;esv3585012              G       <CN0>,<CN2>          100    PASS    
    AC=17,26;AF=0.00339457,0.00519169;AN=5008;CS=DUP_gs;END=1615827;SVTYPE=CNV;VT=SV
1    143984622    esv3587386;esv3587387              A       <CN2>,<CN3>          100    PASS    
    AC=4791,41;AF=0.956669,0.0081869;AN=5008;CS=DUP_gs;END=144094733;NS=2504;SVTYPE=CNV;VT=SV
1    248619876    esv3589555;esv3589556;esv3589557   A       <CN0>,<CN2>,<CN3>    100    PASS    
    AC=19,859,2;AF=0.00379393,0.171526,0.000399361;AN=5008;CS=DUP_gs;END=248634579;SVTYPE=CNV;VT=SV
Y    28462363     CNV_Y_28462363_28740539            T       <CN2>                100    PASS    
    AC=5;AF=0.00408831;AN=1223;END=28740539;SVTYPE=CNV;VT=SV
vcf cnv • 4.9k views
ADD COMMENT
0
Entering edit mode
5.9 years ago

It's like this

For CNVs the REF is <CN1>

Humans are diploid thus : 0/0 == CN1/CN1 or 2 copies


Say the ALT is <CN0>

  • 0/0 = CN1/CN1 = 2 copies

  • 0/1 = CN1/CN0 = 1 copy

  • 1/1 = CN0/CN0 = 0 copies


With multiallelic variants the order of the alleles is the same as the genotypes (in 1-base positions)

So when ALT is <CN0>,<CN2>; the possible genotypes are 0,1,2 corresponding to CN1,CN0,CN2

  • 0/1 = CN1/CN0 = 1 copy

  • 1/2 = CN0/CN2 = 2 copies

  • 2/2 = CN2/CN2 = 4 copies


So if the ALT is <CN0>,<CN2>,<CN3>,<CN4>

The genotypes are (0/Ref), 1, 2, 3, 4 in the same order as the alleles in ALT

1/3 = CN0/CN3 = 3 copies.

ADD COMMENT

Login before adding your answer.

Traffic: 2781 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6