Biostar Beta. Not for public use.
what sql schema to store a vcf file ?
1
Entering edit mode
13 months ago
sacha ♦ 1.7k
France

I would like to know what kind of sql schema do you propose to store a vcf file ? Is there a common way to proceed ? I imagine a table "variant" which contains : chr start ref alt and a second table "sample" with a n-n relation . But How to store genotype ? How to store info field ?

ADD COMMENTlink
2
Entering edit mode

I imagine a table "variant" which contains : chr start ref alt and a second table

not so easy, there is more than one ALT per variant... ALT can be a symbolic allele, a very large string, etc...

I wrote a vcf2sql : https://github.com/lindenb/jvarkit/wiki/VCF2SQL but in the end, it was useless. I found it easier to only store the path to the tabix-indexed VCF files

see also : Vcf And Mongodb

ADD REPLYlink
1
Entering edit mode

Is MongoDB a better alternative than postgreSQL ? What's your opinion ? my goal is to perform some set operation like : listVariantA - ( ListVariantB | ListVariantC) . I m not sure Nosql database is able to do this job faster.

ADD REPLYlink
1
Entering edit mode

listVariantA - ( ListVariantB | ListVariantC)

GATK SelectVariants with "concordance" or "discordance" https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php

ADD REPLYlink
0
Entering edit mode

I know.. ! But Thanks!

ADD REPLYlink
0
Entering edit mode

What about duplicating lines ? A > T,C become : A > T A > C

ADD REPLYlink
0
Entering edit mode

what about het variants ? storing genotypes like "1/2" == "T/C"

ADD REPLYlink
0
Entering edit mode

Dear Pierre,

Kindly help me with this question, Filtering multisample VCF based on genotype using SnpSift filter

ADD REPLYlink
0
Entering edit mode

I've seen this asked many times and have never seen a good, compelling answer. Which makes me suspect that many people think SQL is for some reason, not a good solution and they prefer to use tools which work with the flat file.

ADD REPLYlink
3
Entering edit mode
10 months ago
London

I'd suggest you look at the Gemini database schema (readthedocs seems to be down at the moment, but I'm pretty sure that's the right page)

ADD COMMENTlink
0
Entering edit mode

Thanks ! I already look on this ! I get some ideas !

ADD REPLYlink
0
Entering edit mode

Thanks, Andrew and Pierre for your links.

Hi Sacha, MongoDB vs MySQL which is good enough to build VCF database for multiple samples?

ADD REPLYlink
0
Entering edit mode

that's a good question. MongoDB is really powerful to store Data as Document. It should be faster than sql to display data if you query has low complexity.
But I guess Sql is a better option if you need to perform complex query like Intersection . The main problem with the VCF file is optional data as key-value. It's easy to implement with mongodb because it a key:value database. But sql database needs fixed tables. For my project I use PostgreSQL witch support natively JSON type and Array Type .I store my key-value data into an Array type field. { { key, value}, {key, value}}

ADD REPLYlink
0
Entering edit mode
13 months ago
sacha ♦ 1.7k
France

Attach there is my sql schema to store vcf. It's looks pretty similar than variant_tools database

Briefly : VCF is file . This file has samples. Each sample is attached to a variant . The variant database contains chr,pos,ref,alt as a unique key. info field in "sample_has_variant" contains optional values( Postgre Array type).
genotype field contains : 0 = homozygotes ref , 1 = heterozygote 2 = homozygote alt -1 = composite

Sql database schema

ADD COMMENTlink
0
Entering edit mode

Thanks, Sacha.

In the case of multiple sample VCF, do you vcf-subset each sample into a single sample VCF file?

Can we upload multisample VCF into a MySQL table by changing the number of columns dynamically?

ADD REPLYlink
0
Entering edit mode

I suggest you to check variant_tools software. They load multisample vcf file in an sqlite file. It works pretty .

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1