I am interested on using the TCGA SNP arrays data. What I would like to get is a list of the snps per sample and their position, chromosome, reference and alternate alleles. This would be something like:
sample_id - chromosome - position - reference - alternate
Looking at the TCGA data portal I have found a series of files called genotype.dat (there's one per sample) that contain the following information:
Composite_element_ref chromosome Physical position Genotype
SNP_A-8575115 2 533321 AB
*Fake data
I have assumed that the first column is some kind of id, the second is the chromosome and the third is the position. However I am not sure about the meaning of the forth column.
The possible options that you can find on it are (AA, AB, BB or NC). Does this mean homozygote, heterozygote, not computed? How could I map this SNPS to the actual nucleotides that are being changed (for example C -> T)?
Thanks a lot in advance,
Joan