Question

How to generate an index (unique_key) for each genotype in a VCF File?

0

Entering edit mode

9.9 years ago

Raony Guimarães ★ 1.4k

I'm trying to define a new way to create a "unique_key" for each genotype in a vcf file to use as an index into an SQL database.

For example for this two lines:

#CHROM POS     ID        REF    ALT     QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4

I would get a list of unique indexes like this:

20-14370-G-G
20-14370-A-G
20-14370-A-A
20-1110696-G-T
20-1110696-T-T

Obs: For 1|2 and 2|1 I would get the same id 20-1110696-G-T

My primary reason for this would be to quickly compare and filter some genotypes from two tables, for example:

table1 has 40k rows and table2 has 100 million rows

If I had a "unique id" like that I could quickly eliminate genotypes from table1 that would be present in table2.

Does it make any sense to create a unique id for each genotype or am I trying to reinvent the wheel somehow ?

Is there any other more correct way of achieving this goal of generating a unique id for each genotype present in the VCF?

PS: Yes I know how to intersect positions from a BED or a VCF file, but this time I need to find a way to generate a unique key for each genotype in order to do this comparison quickly using an SQL query.

Thank you for your attention! :)

index unique_key vcf unique_id • 3.5k views

ADD COMMENT • link updated 2.5 years ago by Ram 43k • written 9.9 years ago by Raony Guimarães ★ 1.4k

Ram · Answer 1 · 2014-06-12

0

Entering edit mode

9.9 years ago

Pierre Lindenbaum 161k

for a unique ID use a hash like MD5: http://dev.mysql.com/doc/refman/5.5/en/encryption-functions.html#function_md5

select md5("20-14370-G-G");
+----------------------------------+
| 1155a58368ab7d38210d21216e84659d |
+----------------------------------+

or create a table for chrom_position and link a table genotype to it

also, have a look at gemini

I also implemented a vcf to SQL, but I don't use it at all: https://github.com/lindenb/jvarkit/wiki/VCF2SQL

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks a lot Pierre! Once I come up with this unique id, I could use md5 for generating a hash with it. That's a good idea!

My concern was more about generating this unique_id by translating the genotypes in the last column of the VCF (Ex. 0/0, 0/1 and 1/1) using the columns REF and ALT (Ex. G and A) to something like this G-G, G-A and A-A and combining this with the 20-14370 part.

Linking two tables, one with chr-position and another with the genotypes, its all I want to avoid because of performance issues :) This task will have to be repeated a lot of times!

About GEMINI, i checked their table and I didn't see anything like that in here http://gemini.readthedocs.org/en/latest/content/database_schema.html#the-variants-table

For me it's strange no one ever defined a unique id to express exactly the chr-pos-genotype of an individual. The only way I could imagine how to do this would be by combining some columns of the VCF.

I'm assuming there is no simple way to do this, so my algorithm will be like this:

1) Get columns CHROM (1), POS (2), REF (4), ALT (5) and columns with genotypes

2) Generate a list with REF and ALT columns Ex. [A,G,T]

3) Translate 0/0, 0/1, 1/1 to A-A, A-G, A-T

Here i have to define a way to sort the genotypes cause i want 0/1 == 1/0 == 0|1 == 1|0 in order to generate the same unique id for this genotypes. So G-A would become A-G and T-A would become A-T, just to get the same "md5" for both of them.

4) Integrate the previous string with the chr-position part (20-14370) to FINALLY get my UNIQUE ID for each genotype:

20-14370-G-A

Does it sound too complicated ? :)

Cheers!

ADD REPLY • link 9.9 years ago by Raony Guimarães ★ 1.4k

0

Entering edit mode

For me it's strange no one ever defined a unique id to express exactly the chr-pos-genotype of an individual.

I think people work with indexed VCF and tools like vcf-tools or GATK, people don't need indexes on genotype.

one with chr-position and another with the genotypes, its all I want to avoid because of performance issues: how many genotypes do you have?

you could also use a nosql solution like berkeleydb, using chrom-pos as the key, and an array of A1/A2 as the value.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

You are right, I guess im the only one trying to generate indexes on genotypes ... :)

I have used mongodb and I did exactly what you said chr-pos for the key and genotypes in the array.
What happened was like that, I would query the database to get the records with the same key "chr-pos" in both tables and them compare the genotypes in the array programatically. I guess nowadays everyone compares the genotypes in this way.

But men, this takes a lot longer than using this unique key, specially when you have a lot of genotypes to compare. I guess I will implement this and report back the results, there is no better way to prove my point :)

Thanks a lot again!

ADD REPLY • link 9.9 years ago by Raony Guimarães ★ 1.4k