Biostar Beta. Not for public use.
Question: Compare genotypes between WGS and Targeted pannel
0
Entering edit mode

Hi all,

I am trying to compare the genotypes between two human cohorts: one sequenced by whole genome sequencing (50X) and another one sequenced using a custom panel (250X). I performed a t-distributed stochastic neighbor embedding (t-SNE) analysis and the two populations look perfectly clustered in two different groups.

I suspect the difference in clustering might be due to the usage of different technologies (WGS and target sequencing).

The DNA was sequenced in an Illumina platform and the SNVs were called using GATK HaplotypeCaller and they were recallibrated for both populations. However, the mean total variants per sample is higher in the targeted sequenced cohort.

I created a matrix of 0/1 for absence/presence of variants in each genomic position reported on the VCF file from the WGS and Target cohorts, as shown in the example below:

                Sample1 Sample2 Sample3
chr3:37428076   0   1   0

I created a final SNVs list by adding the cohort-specific coordinates to the other cohort to have the same number of coordinates.

Does anyone know how to perform this kind of comparison?

Thank you.

ADD COMMENTlink 2.1 years ago mpinsach • 0 • updated 2.1 years ago Kevin Blighe 43k
0
Entering edit mode

You should:

  1. filter the datasets so that only common variants are included
  2. Normalise the VCFS / BCFs (bcftools norm -m-any)
  3. merge everything together
  4. Read the data into PLINK and check samples against 1000 genomes ( see Produce PCA bi-plot for 1000 Genomes Phase III in VCF format )
  5. Run the comparisons in PLINK (e.g. logistic regression)

I do not know anything about sample numbers, disease state, or ethnicity, so, cannot provide specifics for tests.

ADD COMMENTlink 2.1 years ago Kevin Blighe 43k
Entering edit mode
0

Dear Kevin,

thank you so much for your reply. I waited to write you back until I tried your suggestions myself.

I followed all your suggestions as well as your post Produce PCA bi-plot for 10000 Genomes Phase III in VCF format [1] but I got stuck after pruning variants from each chromosome from 1000 Genomes. I also don't know how to merge my cohorts file with the 1000 Genomes to be compared in PLINK.

Regarding the sample specifics, the wgs cohort is composed by 200 healthy individuals while the targeted sequencing cohort is composed by 91 cardiac-diseased individuals. Both cohorts are caucasian. Although one comes from America and the other from Spain.

Thank you.

ADD REPLYlink 2.1 years ago
mpinsach
• 0
Entering edit mode
0

Would Spanish be considered Caucasian or Hispanic? The idea of merging with 1000 Genomes is to specifically gauge the influence of ethnicity in your cohort. Without correcting for ethnicity, you may make false-associations.

You should, in that case, merge your 2 datasets together, and then merge with 1000 Genomes.

Are you receiving any error message?

ADD REPLYlink 2.1 years ago
Kevin Blighe
43k
Entering edit mode
0

In the clinical information I received from the Spanish individuals was Caucasian ethnicity.

For my two cohorts I did the following:

  1. Filter the datasets so that only common variants are included. I did it with GATK but I first had to remove multiallelic sites.
  2. Merge everything together with vcf-merge option

Then I followed your instructions from your post "Produce PCA bi-plot for 10000 Genomes Phase III in VCF format" but I don't know in which step I should mix the 1000 Genomes with my merged cohorts and how I should do it.

ADD REPLYlink 2.1 years ago
mpinsach
• 0

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0