Biostar Beta. Not for public use.
Question: Multi-Sample Vcf To Phylogenetic Tree.
Entering edit mode

How can I construct a phylogenetic tree based on the SNP's shared between strains? I have whole genome SNP calls for 10 different strains in a multi-sample vcf.

Are there any tools that can take the vcf as an input for creating phylogenetic trees? Or do I need to convert the multi-sample vcf to another matrix? Which kind of matrix would that be how can I create it from the vcf?

Is there a list somewhere of popular packages that kan be used for creating phylogenetic trees? Or guides on how to go from a multi-sample vcf to a plylogenetic tree.

ADD COMMENTlink 6.4 years ago William ♦ 4.4k • updated 9 months ago rm.umayal24 • 0
Entering edit mode

You can use SNPhylo -->

The only suggestion in order to get it work is that the chromosomes ids in your vcf file have to be numbers (1,2,3,4...) and not like Chr1 or Gm01.

ADD REPLYlink 4.8 years ago
• 60
Entering edit mode

Here is what I did in the SNPrelate package to get a dendogram and pca from my multisample vcf file

#vcf to GDS

snpgdsVCF2GDS("my.vcf", "my.gds")


genofile <- openfn.gds("my.gds")


dissMatrix  <-  snpgdsDiss(genofile ,,, autosome.only=TRUE,remove.monosnp=TRUE, maf=NaN, missing.rate=NaN, num.thread=10, verbose=TRUE)

snpHCluster <-  snpgdsHCluster(dist,, need.mat=TRUE, hang=0.25)

cutTree <- snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000,,col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, verbose=TRUE)

#pca <- read.gdsn(index.gdsn(genofile, ""))

pop_code <- read.gdsn(index.gdsn(genofile, "")

pca <- snpgdsPCA(genofile)

tab <- data.frame( = pca$,pop = factor(pop_code)[match(pca$,],EV1 = pca$eigenvect[,1],EV2 = pca$eigenvect[,2],stringsAsFactors = FALSE)

plot(tab$EV2, tab$EV1, col=as.integer(tab$pop),xlab="eigenvector 2", ylab="eigenvector 1")
legend("topleft", legend=levels(tab$pop), pch="o", col=1:nlevels(tab$pop))
ADD COMMENTlink 6.3 years ago William ♦ 4.4k
Entering edit mode

Hi it seems a good tool, but how should I proceed if I want to construct a Phylogenetic tree from .vcf files from different samples. Do i have to concatenate them to create a multisample vcf or i can manege them independently to create the tree.?

Any advice is helpfull

Best Celso

ADD REPLYlink 6.3 years ago
• 0
Entering edit mode

You could use either approach.

1) Use a tool to create a multi-sample VCF (e.g. VCFtools)


2) Use snpgdsVCF2GDS() to read in each VCF, then merge in R using snpgdsCombineGeno().

ADD REPLYlink 6.3 years ago
Entering edit mode

Thanks Neil, I'll try your recomendations.

ADD REPLYlink 6.3 years ago
• 0
Entering edit mode

when i execute the R script i get the following error "Removing 181 non-autosomal SNPs Error in snpgdsDiss(genofile) : There is no SNP!" Obviously, the error is because there aren't SNPs. What parameter do i need to change to avoid this problem? , Thanks!

ADD REPLYlink 6.2 years ago
Diego D.
• 40
Entering edit mode

As I understand it this creates a PCA from all of the snps in vcf. How can one filter a vcf so as to get only unlinked variants?

ADD REPLYlink 22 months ago
• 10
Entering edit mode

I'd look at the R package SNPRelate. It will read VCF files, create various matrices and plot dendrograms using _e.g._ an identity-by-state matrix. See examples in the vignette PDF.

ADD COMMENTlink 6.4 years ago Neilfws 48k
Entering edit mode

Thanks the program worked really nice for creating both a phylogenetic tree and a principal component analysis plot almost directly on the vcf file.

ADD REPLYlink 6.4 years ago
♦ 4.4k
Entering edit mode


I had SNPs for 39 WES samples, some of them were from related individuals. I wanted to check the kinship to see if there any mislabelling during the sample processing.

Finally I've build a nice tree with the code below.

It was validated with independent observations (family diagrams, ancestry). All the unrelated individuals were connected above the FC (first cousins) line, all sibs, half-sibs, and other relatives were where they should be.

The crucial steps were using IBS function to calculate distances and taking LD into account.
Without these two I got just misleading trees. The default LD threshold (0.2) removed too many SNPs, I increased it to 0.5 to achieve higher sensitivity. LD filtation reduced 500K SNPs to 16K.

#install SNPRelate as described here:
#prepare multisample vcf with bcftools merge 

setwd([your dir here])

#biallelic by default
snpgdsVCF2GDS("dataset1.vcf", "dataset1.gds")
genofile = snpgdsOpen("dataset1.gds")

#LD based SNP pruning
snpset = snpgdsLDpruning(genofile,ld.threshold = 0.5)

# distance matrix - use IBS
dissMatrix  =  snpgdsIBS(genofile ,,, autosome.only=TRUE, 
    remove.monosnp=TRUE,  maf=NaN, missing.rate=NaN, num.thread=2, verbose=TRUE)

snpHCluster =  snpgdsHCluster(dissMatrix,, need.mat=TRUE, hang=0.01)

cutTree = snpgdsCutTree(snpHCluster, z.threshold=15, outlier.n=5, n.perm = 5000,, 
    col.outlier="red", col.list=NULL, pch.outlier=4, pch.list=NULL,label.H=FALSE, label.Z=TRUE, 

snpgdsDrawTree(cutTree, main = "Dataset 1",edgePar=list(col=rgb(0.5,0.5,0.5,0.75),t.col="black"),

I hope this will be helpful for somebody.


ADD COMMENTlink 3.4 years ago Sergey Naumenko • 350
Entering edit mode

Thanks for the tip! I want to draw a dendrogram based on RADtag based snps from a non-model organism, and the vcf file output by Stacks. I managed to do this (only) with IBS, using these commands:

snpgdsVCF2GDS("batch_511.vcf", "batch_511.gds")
genofile <- openfn.gds("batch_511.gds")
ibs.hc <- snpgdsHCluster(snpgdsIBS(genofile, num.thread=2))
rv <- snpgdsCutTree(ibs.hc)
plot(rv$dendrogram, leaflab="perpendicular", main="Batch 511")

I suppose the dendrogram is based on distance clustering, but that's not clear from the documentation. Does anyone know? And what are the units of the scalebar in the resulting graph?

Finally, I haven't yet succeeded in getting any results with the ML alternative in SNPSRelate. Should it be possible at all without phased data?

Louis Boumans

ADD COMMENTlink 5.6 years ago Louis Boumans • 20
Entering edit mode

I just got done installing SNPRelate on R 3.1.1, OSX Yosemite. In order to save yourself some time do the following.

First, make sure you have gfortran installed. Follow instructions to install gfortran from here:

Second, download and install both gdsfmt and snprelate from source:

ADD COMMENTlink 4.9 years ago danrdanny • 60
Entering edit mode

Hi all,
My question is related to this one, so that's the reasson I'm writing here and not in new post. There is a way to make a phylogenetic tree, just as in Williams answer, but using maximum likelihood method? There is a function inside SNPRelate package called snpgdsIBDMLE which performs this task but I'm not able to get a phylogenetic tree image.

Any suggestion?

ADD COMMENTlink 4.2 years ago user230613 • 280
Entering edit mode

Hi all,

In order to create the phylogenetic tree from whole genome SNP file including the human data, the following software would be helpful.

It is known as the VCF2PopTree software and it is available on It is so cool and it does not need any dependencies. Just a HTML file is sufficient enough to get the Phylo tree.

Very simple and straight forward. Highly recommended for the evolutionary biologists and population geneticists.

ADD COMMENTlink 9 months ago rm.umayal24 • 0

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0