Biostar Beta. Not for public use.
pulling out a single pseudochromosome from my vcf file
1
Entering edit mode
14 months ago

Hi everyone,

I am trying to extract a single chromosome from my dataset so that I can run analyses using only this sex chromosome and compare to results of similar analyses on the entire genome. However, all of my chromosomes are labeled as "Pseudochromosomes", i.e., the chromosome I want to extract and make it's own vcf file is "Pseudochromosome_Z". I read that the following command should do the trick for regularly named chromosomes:

 grep -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

However, how do I do this for pseudochromosomes? Thanks for bearing with my ignorance.

genome • 211 views
ADD COMMENTlink
0
Entering edit mode

grep -v will work to exclude things specified. So try adding -v to your command above.

ADD REPLYlink
0
Entering edit mode

Thanks for the reply. Unfortunately, that is not working for me. When I add "-v", the new file is the same size as the original vcf. I want a new vcf file that will contain only the Z chromosome. When I try without "v", the new file is only 10kb, tiny and containing no information besides a chromosome list.

ADD REPLYlink
0
Entering edit mode

Is your regular expression correct?

ADD REPLYlink
0
Entering edit mode

I'm not entirely sure. I don't have a vcf file with chromosomes labeled that would fit as "chr". Only the vcf file with Pseudochromosomes, so I can't really test whether the original line works. I got the line from here: How to extract specific chromosome from vcf file

I just tried this: grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Any idea what I'm doing wrong? I'm guessing it probably has no idea what "Z" is since I believe it's looking for just a chromosome number and not a pseudochromosome. Thanks!

ADD REPLYlink
0
Entering edit mode

You have to use the actual names that are in your file in expression. Show the header of your vcf file and a few lines. Someone can correct the expression.

ADD REPLYlink
0
Entering edit mode

I've supplied what I think is the correct information below. Would you mind helping me correcting my expression? I believe it is something close to the following:

 grep -v -w '^#\|^#CHROM\|^chr[Z]' selas.dp1maxmiss50maf05f_subset_.012.vcf > Z.vcf

Thanks!

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##bcftoolsVersion=1.8+htslib-1.8
##bcftoolsCommand=mpileup -Ou -q 20 -a DP -r Pseudochromosome_33 -f /rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta -b bamlist.txt
##reference=file:///rhome/abrelsford/shared/bwaDb/bCalAnn1_v1.p.fasta
##contig=<ID=Pseudochromosome_1,length=197551010>
##contig=<ID=Pseudochromosome_2,length=151342139>
##contig=<ID=Pseudochromosome_3,length=114810999>
##contig=<ID=Pseudochromosome_4,length=18597117>
##contig=<ID=Pseudochromosome_4A,length=44745344>
##contig=<ID=Pseudochromosome_4B,length=23291945>
##contig=<ID=Pseudochromosome_5,length=16645885>
##contig=<ID=Pseudochromosome_5A,length=43880846>
##contig=<ID=Pseudochromosome_6,length=35401958>
##contig=<ID=Pseudochromosome_7,length=39139214>
##contig=<ID=Pseudochromosome_8,length=31090148>
##contig=<ID=Pseudochromosome_9,length=25686456>
##contig=<ID=Pseudochromosome_10,length=22664390>
##contig=<ID=Pseudochromosome_11,length=20302349>
##contig=<ID=Pseudochromosome_12,length=21352500>
##contig=<ID=Pseudochromosome_13,length=17696115>
##contig=<ID=Pseudochromosome_14,length=15497061>
##contig=<ID=Pseudochromosome_15,length=13887164>
##contig=<ID=Pseudochromosome_17,length=10473655>
##contig=<ID=Pseudochromosome_18,length=11617720>
##contig=<ID=Pseudochromosome_19,length=11166003>
##contig=<ID=Pseudochromosome_20,length=15116875>
##contig=<ID=Pseudochromosome_21,length=7710055>
##contig=<ID=Pseudochromosome_22,length=5198135>
##contig=<ID=Pseudochromosome_23,length=6423903>
##contig=<ID=Pseudochromosome_24,length=6461084>
##contig=<ID=Pseudochromosome_25,length=2161911>
##contig=<ID=Pseudochromosome_26,length=6306732>
##contig=<ID=Pseudochromosome_27,length=5749075>
##contig=<ID=Pseudochromosome_28,length=5713987>
##contig=<ID=Pseudochromosome_33,length=1888140>
##contig=<ID=Pseudochromosome_Z,length=74081004>
ADD REPLYlink
3
Entering edit mode
4 months ago
genomax 68k
United States

While following may work,

grep -w 'Pseudochromosome_Z' your.vcf > Z.vcf

you should ideally use bcftools for this since it will preserve the headers etc. You will need to add these to VCF yourself with method above.

bcftools view input.vcf.gz --regions Pseudochromosome_Z

You will need to sort and tabix compress the vcf file. (Ref: How to extract specific chromosome from vcf file )

ADD COMMENTlink
0
Entering edit mode

Thanks! I've made some progress now! I was able to compress and index the vcf file, and this command (bcftools view input.vcf.gz --regions Pseudochromosome_Z) worked, allowing me to view the Z chromosome. How can I save a separate vcf file with only the information from the Z chromosome?

ADD REPLYlink
0
Entering edit mode
bcftools view input.vcf.gz --regions Pseudochromosome_Z > Z.vcf
ADD REPLYlink
0
Entering edit mode

Thank you for your help! This resolved the issue.

ADD REPLYlink
0
Entering edit mode

You can accept the parent answer to provide closure to this thread.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.3