Biostar Beta. Not for public use.
Slicing a vcf file to just a few thousand SNPs with PyVCF.
0
Entering edit mode
3.6 years ago
Pedro Morell • 10
@Pedro Morell22807

Hi, I'm trying to reduce a vcf file from 1000 Genomes to just a set of SNPs (around 40k). I have the list of dbSNPs IDs stored on a pandas Series, and I'm trying to retrieve just those genes with a "if record in series:" , but it's not working. Any suggestion on how to call for just certaing SNPs?

Thank you.

Python VCF 1000Genomes • 902 views
2
Entering edit mode
3.6 years ago
@Pierre Lindenbaum30
--keepIDs / -IDs

List of variant IDs to select
If a file containing a list of IDs is provided to this argument, the tool will only select variants whose ID field is present in this list of IDs. The matching is done by exact string matching. The expected file format is simply plain text with one ID per line

0
Entering edit mode

Thanks! I'm trying to use this method, but I'm having an Error with the input. My file looks like this:

rs369720564
rs369711657
rs369708999
rs369700777
rs369687712


So, according to the description (plain text, a single ID per row) it should be working. Any advices?

0
Entering edit mode

I'm having an Error with the input

what is the error ?

0
Entering edit mode

ERROR MESSAGE: Invalid argument value 'HCL.txt' at position 4.

Also tryed with a toy list copied from the IDs in the vcf and got the same message, so the format is the issue.

0
Entering edit mode

can you send the full cmd line please.

0
Entering edit mode
java -jar GenomeAnalysisTK.jar -T SelectVariants --variant input.vcf --keepIDs snps.txt --out output.vcf

0
Entering edit mode

first, GATK cannot work without the option -R ref.fa

0
Entering edit mode

second, this is not the cmd line you used as there is no such argument value 'HCL.txt' at position 4.

0
Entering edit mode

I actually copy/pasted it, but decided to change the names in favor of clarity. I tryed with the reference genome, and now I'm getting this error instead:

ERROR MESSAGE: Input files /home/hjorvik/Escritorio/TFM/Datos/1KG/ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf and reference have incompatible contigs. Please see https://software.broadinstitute.org/gatk/documentation/article?id=63for more information. Error details: No overlapping contigs found.


This is the cmd line I used, this time with its original names:

java -jar GenomeAnalysisTK.jar -R GRCh38_full_analysis_set_plus_decoy_hla.fa -T SelectVariants --variant ALL.chr6.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf -IDs HCL.txt --out chrm6_trimed.vcf

1
Entering edit mode
0
Entering edit mode

check #chrom column in your VCF and match with that from fasta file headers (chromosome/contig names)

0
Entering edit mode

I done this, but still have the same problem. I've realized that, even if I changed my chrom column, the contigs in the header are still labeled as 1,2,3,4. How can I change them using gawk?