Question

R plot for variant calling

0

Entering edit mode

4.9 years ago

evelyn ▴ 230

hi,

I have tried using upset plot for three vcf files from different pipelines. I extracted the variant column (SNPs) and used these csv files (with one column) for R import. I have used this code:

set1 <- read.csv("set1.vcf", sep="")
set2 <- read.csv("set2.vcf", sep="")
set3 <- read.csv("set3.vcf", sep="")

set1 <- as.vector(set1$V1)
set2 <- as.vector(set2$v1)
set3 <- as.vector(set3$V1)

read_sets = list(set1_reads = set1,
set2_reads = set2,
set3_reads = set3)

upset(fromList(read_sets),
       sets = c("set1_reads", "set2_reads", "set3_reads"),
       number.angles = 20, point.size = 2.5, line.size = 1.5,
       mainbar.y.label = "read intersection", sets.x.label = "read set size",
       text.scale = c(1.5, 1.5, 1.25, 1.25, 1.5, 1.5), mb.ratio = c(0.65, 0.35),
       group.by = "freq", keep.order = TRUE)

It gives an intersection plot but when the number of SNPs from upset plot are really low when I compared these with vcf-compare results using same vcf files. I am not sure why I am getting different numbers with upset plot.

SNP R • 2.7k views

ADD COMMENT • link 4.9 years ago by evelyn ▴ 230

1

Entering edit mode

If you don't mind shifting to python, I have some code which does what you are looking for, more or less, in this repository: https://github.com/wdecoster/surpyvor/

Let me know if you can't figure it out - sooner or later I'll implement it for SNVs as well.

ADD REPLY • link 4.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I have never used python.I will give it a try and will let you know how it works.

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

Can you post example lines from your VCF files? Especially lines that are found to overlap via upsetR and those that are not so we can see the difference. My instinct would be slightly different formatting if you are seeing some overlapping, but looking at some example lines would help.

ADD REPLY • link 4.9 years ago by shawn.w.foley ★ 1.3k

0

Entering edit mode

I am not sure how to extract the overlapping SNPs/vcf lines.

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

You can use grep to do that. By the way, why are you reading VCF files using read.csv? Even if the defaults set for read.csv (as opposed to read.table which is slightly better here) are being overridden because you set the sep, you set the sep to "", which means the result data frame has one column per character.

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

If I use read.table to import, I get this error:

Error in start_col:end_col : argument of length 0

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

Are your vcf file standard vcf format files ? Would be nice if you can share example "set1.vcf".

ADD REPLY • link 4.9 years ago by zx8754 11k

0

Entering edit mode

I extracted Chrom REF ALT columns by removing the headers for R import like this (some lines from set1.vcf):

#CHROM  REF ALT
ch00    A   AATCATCATCATCATCATC
ch00    A   G
ch00    A   C
ch00    T   A
ch00    G   C
ch00    G   A
ch00    C   T
ch00    A   T
ch00    TTC T
ch00    C   A
ch00    C   T
ch00    T   A

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

It is hard to guess without actual data, could you post outputs of:

vcf-compare -H set1.vcf set2.vcf set3.vcf

versus in R:

length(intersect(set1, set2))
length(intersect(set1, set3))
length(intersect(set2, set3))

length(setdiff(set1, set2))
length(setdiff(set1, set3))
length(setdifft(set2, set3))

etc...

ADD REPLY • link 4.9 years ago by zx8754 11k

score 3 · Accepted Answer · 2019-05-14

3

Entering edit mode

4.9 years ago

AK ★ 2.2k

Try this one (note that set1.vcf, set2.vcf, and set3.vcf are the original/real vcf without grep, and it's read.vcfR):

rm(list = ls())
library(UpSetR)
library(vcfR)

set1 <- read.vcfR("set1.vcf")
set2 <- read.vcfR("set2.vcf")
set3 <- read.vcfR("set3.vcf")

set1 <-
  as.vector(paste(set1@fix[, "CHROM"], set1@fix[, "POS"], sep = "_"))
set2 <-
  as.vector(paste(set2@fix[, "CHROM"], set2@fix[, "POS"], sep = "_"))
set3 <-
  as.vector(paste(set3@fix[, "CHROM"], set3@fix[, "POS"], sep = "_"))

read_sets <- list(set1_reads = set1,
                  set2_reads = set2,
                  set3_reads = set3)

upset(fromList(read_sets), order.by = "freq")

If you would like to compare genotypes (like -g, --cmp-genotypes Compare genotypes, not only positions in vcf-compare), adding "ALT" in addition to "CHROM" and "POS" should work.

ADD COMMENT • link 4.9 years ago by AK ★ 2.2k

0

Entering edit mode

It worked. I got same results as of vcf-compare. Thank you!

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one answer if they all work.

Upvote|Bookmark|Accept

ADD REPLY • link 4.9 years ago by Ram 43k

0

Entering edit mode

I tried to use the same codes for six vcf files but it shows the plot for five files only.

ADD REPLY • link 4.9 years ago by evelyn ▴ 230

0

Entering edit mode

This is already a new question, please post a new question with a link to this one. In this post the question was based on 3 sets of files, and this solution worked for you, consider accepting this as answer - tick.

ADD REPLY • link 4.9 years ago by zx8754 11k