Question

Is it 100% correct to transform SNP data (categorical) to (0, 1, 2) format to apply ML algorithms later? Why not binary (0, 1) data?

0

Entering edit mode

7.8 years ago

mgvaldesgraterol ▴ 10

I wanted to know why is it correct to transform SNP data to 0, 1, 2 format using a reference allele, for example: SNP1 with C/T alleles, transformation rules: CC = 2, CT = 1, TT = 0, to later apply machine learning algorithms for predict a specific trait?

I ask this because giving this ordinal values to SNP data may affect greatly the result of a classification model, since in a way, we are giving "more importance" to diploid "CC" with a bigger value of 2, than to diploid TT with a value 0.

Wouldn't it be better and correct to transform the data into a binary format, where each SNP feature will be transformed to 4 binary features: SNP1_CC, SNP1_CT, SNP1_TC, SNP1_TT. Following this, the sample:

ID SNP1 SNP2 1 CC AG

Will be transformed to:

ID SNP1_CC SNP1_CT SNP1_TC SNP1_TT SNP2_GG SNP2_GA SNP2_AG SNP2_AA 1 1 0 0 0 0 0 1 0

SNP snp • 4.3k views

ADD COMMENT • link updated 7.8 years ago by Giovanni M Dall'Olio 28k • written 7.8 years ago by mgvaldesgraterol ▴ 10

1

Entering edit mode

I don't think because you transform it to categorical 0, 1, 2 that it's necessarily ranked 0 < 1 < 2 Could as well transform it to categorical "donkey" (homozygous reference), "pig" (homogygous variant) and "chicken" (heterozygous). It's a label.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I understand, what you say that it is just a re-labeling, but what I mean is to transform categorical data to numerical data, so I can apply ML methods that use numeric data and do not support categorical data. Is it still correct?

ADD REPLY • link 7.8 years ago by mgvaldesgraterol ▴ 10

2

Entering edit mode

Absolutely not. A 2 is not double the effect of a 1 for a simple dominant trait. And only zero is affected in a simple recessive trait. A numeric-only ML algo will absolutely screw this up.

ADD REPLY • link 7.8 years ago by karl.stamm 4.1k

0

Entering edit mode

I'm confussed, sorry... So you say (0, 1, 2) as numeric data is an incorrect input for a ML algorithm? And a (0, 1) encoding would be more appropriate one?

ADD REPLY • link 7.8 years ago by mgvaldesgraterol ▴ 10

0

Entering edit mode

No. Numeric input (0, 1, 2) is incorrect. Categorical input (0, 1, 2) is fine.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

0

Entering edit mode

I already told you that I understand that categorical input (0, 1, 2) is ok, because this would be just relabeling the data. But this is not what I'm asking, I'm asking what kind of numerical transformation of SNP categorical data is better to later apply ML algorithms that use as input, only numeric data.

ADD REPLY • link 7.8 years ago by mgvaldesgraterol ▴ 10

1

Entering edit mode

There is no appropriate transformation. You could argue that homozygous for the most prevalent allele is the least likely to be harmful and could be encoded as 0/neutral. But as John wrote, there are examples of heterozygous advantages compared to both homozygous types. So no general good rules. I like the idea of applying ML to variant data, but you should know that most likely most variants are harmless or with minimal effect... and just adding noise.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

1

Entering edit mode

Every genetic haplotype has the potential to result in a totally different phenotype. It might be that 0/0 is bad, 0/1 makes you healthier, and 1/1 gives you sickle-cell anaemia.

More over, haplotypes in isolation might not make much sense either. 1/1 of allele A might cause cancer, but 1/1 of A and 1/1 of allele B might cancel each other out, and in the process protect you from other cancers. Bottom line, some assumptions and simplification of the real problem will have to be done in your model, and it's more important that you respect the assumptions made, than pick the "best" assumption and then pretend your model is the best possible model without any limitations. What i'm saying is, choosing to turn categorical data into continuous data will sensitise your model to diseases that work that way - and that might be a good thing. Choosing a model where every genotype is it's own independent observation might sensitise your model for more complex-trait diseases, and miss more obvious ones.

ADD REPLY • link 7.8 years ago by John 13k

score 0 · Answer 1 · 2016-07-15

0

Entering edit mode

7.8 years ago

Giovanni M Dall'Olio 28k

If I understand well, in this case each variant would be encoded by 4 binary numbers.

However the problem is that these 4 numbers would be related among themselves (if one is 1, the other must necessarily be 0), meaning that you can't use them as independent observations. This is usually bad for any machine learning or regression method.

One alternative may be to use only haplotypes instead of genotypes, e.g. have one string for every copy of a chromosome. This will only work if the data are phased, and there are no triallelic SNPs.

ADD COMMENT • link 7.8 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Independent observations? This are variables/features...

My dataset looks similar to somethig like this... with only biallelic SNPs.

ID SNP1 SNP2 SNP3 SNP4 ... 1 GA CC GG TT 2 AA AC TG AT ...

So actually I can have 16 possible combinations of ACGT alleles. So my resulting dataset would be transformed so that each SNP will be represented by 16 columns with 0/1 values.

Is this correct? I'm sorry If I'm not understanding, I'm a computer scientist and I'm new to this bio-informatics world and I'm very new to all this biology terminology.

ADD REPLY • link 7.8 years ago by mgvaldesgraterol ▴ 10

0

Entering edit mode

In for example GWAS it does not matter whether you use GG - AG - AA or any other combination since you just checking the difference in minor allele frequency or genotypes between cases and controls. SNPs can have indeed those 16 combinations, but for GWAS etc that is not really relevant, this information is more relevant for follow up and identifying the actual causal effect. Biallelic SNPs have one reference allele and an alternative allele and for initial analyses it does not matter which combination (GA CC GG TT AA AC TG AT) or in which format (0 , 1 or 2) .

ADD REPLY • link 7.8 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

I understand what you are saying, but the fact is that I'm not performing GWAS, I want to apply machine learning algorithms to SNP data of this kind to predict/classify different traits.

ADD REPLY • link 7.8 years ago by mgvaldesgraterol ▴ 10