Isolating and extracting regions in VCF file
2
1
Entering edit mode
3 months ago
iarmir ▴ 10

I want to isolate specific regions in the vcf file chr17:63480412 (rs3730025) and chr17:63481680 (rs56394458) into a separate vcf file with only the SNP data from all my hundreds of files.

I only want to annotate these regions. Isolating them will help me annotate them many times faster

How do I do this?

ANNOVAR vcftools bcftools GATK VCF • 799 views
ADD COMMENT
0
Entering edit mode

I see you tagged bcftools. Have you looked at the documentation?

ADD REPLY
2
Entering edit mode
3 months ago

Once you really have hundreds of files, it's time to consider a variant warehouse. Then you will be spending your time querying instead of manipulating VCF files.

One open source warehouse is TileDB-VCF (repo)

import tiledb
import tiledbvcf

vcf_uri = "my_vcf_dataset"
ds = tiledbvcf.Dataset(uri="mytiledbvcfstore", mode="w")
ds.create_dataset(enable_variant_stats=True, enable_allele_count=True) 
ds.ingest_samples(sample_uris = "input.vcf.gz")

df = ds.read(
   regions = ["chr17:63480412-63480412","chr17:63481680-63481680"]
   attrs   = ["sample_name", "contig", "pos_start", "pos_end", "alleles", "fmt_GT"],
)
df

We do have some scripts that can help with VEP annotation on TileDB arrays

More info here (blog) and here (docs)

ADD COMMENT
0
Entering edit mode

Thank you a lot! Never heard about TileDB

Can I write so that all my vcf files are sent to input?

"ds.ingest_samples(sample_uris = "*.DeepVariant.vcf.gz")"?

Will it work correctly?

ADD REPLY
0
Entering edit mode

for wildcards on a local disk you can use the cli

tiledbvcf store --uri my_vcf_dataset *.DeepVariant.vcf.gz

or just use python glob

ds.ingest_samples(sample_uris = glob.glob("*.DeepVariant.vcf.gz"))

ADD REPLY
1
Entering edit mode
3 months ago

A the manual states you can use the -r regions, or the -R, regions-file parameters:

it would look like:

bcftools view -r chr17:63480412,chr17:63481680 input.vcf -o output.vcf

or

bcftools view -R regions.bed input.vcf -o output.vcf

now to operate on all files, use gnu parallel, assuming your filenames are in a file, it would look something like:

cat filenames.txt | parallel bcftools view -R regions.bed {} -o {}.small.vcf
ADD COMMENT
0
Entering edit mode

input.vcf.gz or input.bcf .vcf should be bgzipped+ indexed so it cannot be a plain vcf

or use bcftools view -T

ADD REPLY
0
Entering edit mode

Thanks a lot! Can I use this?

zcat *.vcf.gz | parallel bcftools view -R regions.bed {} -o {}.small.vcf

My vcf files are in .vcf.gz format

After regions.bed should the brackets be empty?

ADD REPLY

Login before adding your answer.

Traffic: 1651 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6