I received an hg38 VCF file that's had variants imputed with 1000 genomes. I've encountered some issues with the VCF; REF alleles that do not align to a reference genome, ALT alleles that do not appear to be reported anywhere in the literature, and, most recently, variants that flat-out do not align to the human genome (variants on chr19 with bp-pos 100 million+ when the whole chromosome is in the 50 million bp range).
I've worked out hack-y solutions to most of the issues that I've encountered, but this latest one has been an issue for me. I only detected these variants when I ran VEP and it flagged them as not mapping to the genome. As such, I'm more or less removing these variants one at a time using grep -v
. I'd like a solution where I can just remove any variants from the vcf that appear to map to regions that do not exist in the human genome. Bonus points if the solution also encompasses some of the other issues I mentioned, although I think I've already found solutions to those. Is there anything out there that does this?
Hello john.michel.rouhana!
It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8629/remove-variants-that-do-not-map-to-human-genome
This is typically not recommended as it uses the finite time of volunteers in both communities.
I wasn't aware- thank you for making this apparent. I thought it'd make the most sense to post it in both locations. Thanks for the etiquette lesson. Is there any way to remove my post here?
You received an answer already which is why we would restore a deleted post anyway out of respect for the user who invested time to answer. Don't worry, leave the question here but for the future, please consider not to cross-post as many users are active in both communities, avoiding double-efforts ;-)
The minimal you could do is link both posts to each other, so contributors on forum A will find that someone has replied on forum B.