Plot number of mutations in each cancer type in VCF files?
2
0
Entering edit mode
5.8 years ago
DanielC ▴ 160

Dear Friends,

I am trying to plot "number of variants in each cancer type in vcf files". Could you please let me know how to do this using R/python or bash? I have a text file of the samples and the cancer type associated with it, like this below:

Samples                         Cancer Type
TCGA-XXX.barcode                ACC
.
.

I am new to this and learning. Thank much!

DK

vcf SNP plot • 3.7k views
ADD COMMENT
0
Entering edit mode

It is unclear which data you have - be specific, e.g. number of samples - and which type of plot you aim to obtain. Please elaborate and show an example.

ADD REPLY
0
Entering edit mode

Thanks! It is tcga cancer data vcf files. These are merged vcf files of about 10000 samples for each chromosome. I do not know the name of the plot but am looking for is a plot of "number of variants in each cancer type in the vcf files". Please let me know if am clear and what you think could be done to obtain this? Thanks much!

ADD REPLY
0
Entering edit mode

Can you perhaps draw the plot you have in mind on a piece of paper, take a picture and show us? How should the 10k samples be summarized?

ADD REPLY
0
Entering edit mode

Thanks for your reply! This is the type of plot I am looking to generate from VCF files: https://www.dropbox.com/s/rfvyw8b8v62lhuz/example.jpg?dl=0

Please let me know how to generate this plot from vcf files. Thanks

ADD REPLY
0
Entering edit mode

See also How to add images to a Biostars post

How do you link the sample identifiers in the vcf to the cancer types?

Please update your initial question when adding information. We are losing valuable time here because I have to ask for clarification every time. I assume people who can help you don't want to go through all these comments.

ADD REPLY
0
Entering edit mode

Thanks! I will remember that next time.

To link the identifiers to the cancer type I have a text file like this:

Sample                           Cancer Type
TCGA-XXX-barcode      ACC

Now, am trying to figure out how to use this information to plot "number of variants for each cancer type" using this above file and the merged vcf file for 10000 samples. Please let me know if am missing any info here, I will provide them. Thanks.

ADD REPLY
1
Entering edit mode

Also that file is important information which should have been part of your first post. We have now wasted 11 hours until we found all required information.

Please update your initial question when adding information.

I can solve this in Python, but not in R.

ADD REPLY
0
Entering edit mode

Thanks, I have updated the question. Could you please share with me how we can plot such a plot with these available info? I would really appreciate. Thanks.

ADD REPLY
4
Entering edit mode
5.8 years ago

I wrote a script which should be able to handle this.

The script below takes two arguments:

  • --samples: your list of samples with their cancer type. No header, just a (space) separated file with two columns.
  • --vcf: your vcf file, for which all sample names are found in the file specified by --samples

This script requires cyvcf2 and matplotlib which you can install from pip:

pip install -U cyvcf2
pip install -U matplotlib

Save the code, e.g. as samples_to_hist.py and execute as (fill in the proper files)

python samples_to_hist.py --samples samples.txt --vcf variants.vcf

Please let me know how it goes.

ADD COMMENT
0
Entering edit mode

Thanks much! please clarify these queries:

a) my samples.txt has 8000 list and vcf file has 10389 samples. Will the program run in this scenario?

b) can the program be made to run on vcf.gz files?

Thanks much!

ADD REPLY
1
Entering edit mode

a) No it will explicitly fail because it encountered samples in the vcf which were not in samples.txt (function test_vcf, line 29).

b) It already supports vcf.gz files ;-)

ADD REPLY
0
Entering edit mode

Thanks ! is it possible to modify the program for scenario a) where the number of samples in samples.txt is less than the number of samples in vcf file. :-)

ADD REPLY
0
Entering edit mode

Yeah that's possible but I don't know when I'll have time to adapt that. Or you could adapt your input file.

ADD REPLY
0
Entering edit mode

Thanks for the reply! Since the number of samples is 10389 in the vcf file, I will have to find the unmatched ones from samples.txt and then delete those fields from the vcf file which has a huge size so will take lot of time. If possible, I would really appreciate if modifications could be made in the script. :-) Thanks.

ADD REPLY
0
Entering edit mode

bcftools can do what you are looking for

ADD REPLY
0
Entering edit mode

Great! am learning useful stuff here. Could you please guide me to the source on how to do this using bcftools. Thanks much!

ADD REPLY
0
Entering edit mode

Have you gone through the manual?

ADD REPLY
0
Entering edit mode

Yes, please let me know if this is right:

bcftools -S samples-to-remove.txt XX.vcf.gz > filtered-vcf.vcf

samples-to-remove.txt:
^TCGA...1
^TCGA...2

. .

ADD REPLY
4
Entering edit mode

Whether that command works right should be really easy to confirm on your own. I think I did enough here - time for you to show some effort too.

ADD REPLY
0
Entering edit mode
16 months ago
cocchi.e89 ▴ 260

Take a look at this: plot-VCF

It allows you to plot any flag, sample, divide cases and controls, operate gene-analysis etc. (well documented on the GitHub page)

enter image description here

ADD COMMENT

Login before adding your answer.

Traffic: 1702 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6