Question

Relating Chromosome Position With Id Number

0

Entering edit mode

12.3 years ago

Sofia • 0

Hi, I have two data sets containing mutations, one is from 1000 genomes and the other is my data. Below is an example row of each data set:

my_data (1 row and 13 columns):

3BHS2_HUMAN_A10E      P26439    10      rs28934880   A    E      C    A  probably_damaging      alignment       neutral         0.328         0.145

1000G (1 row and 8 columns):

20    59132666    .    G    A    56.74    PASS
AB=0.53;AC=6;AF=0.0102;AN=586;BaseQRankSum=3.459;BaseQRankSumZ=-0.123;DP=1115;Dels=0.00;HRun=1;HaplotypeScore=0.1646;MQ=95.36;MQ0=0;MQRankSum=1.213;MQRankSumZ=0..694;QD=4.19;ReadPosRankSum=1.963;ReadPosRankSumZ=0.349;SB=-0.49;VQSLOD=4.0533;set=ALL119

I'm writing a Python script where I would like for each mutation on each line in mydata match That mutation with the correct mutation (line) in 1000G. All I have is this information above. My question is how could I relate the information in mydata with the information I have from 1000G? What I want is the chromosome position or to know that I'm looking at the same mutation (if it exists) in both files. Is this possible to achieve?

Best,

Sofia

chromosome id position • 3.0k views

ADD COMMENT • link updated 8.4 years ago by Biostar 20 • written 12.3 years ago by Sofia • 0

0

Entering edit mode

Would help to see the headers for the columns. For example, what is "10" in your column 3? Chromosome?

ADD REPLY • link 12.3 years ago by Neilfws 49k

score 1 · Answer 1 · 2012-01-03

1

Entering edit mode

12.3 years ago

Larry_Parnell 16k

I would either relate by chromosome and position (being absolutely certain that the genome builds are identical for both datasets) or by assigning rs id#s to the 1000G data (then relate by rs id).

ADD COMMENT • link 12.3 years ago by Larry_Parnell 16k

0

Entering edit mode

that would work if she had such information on her "my_data", and for that reason she needs to do an intermediate annotation step in order to uniquely relate entries among datasets.

ADD REPLY • link 12.3 years ago by Jorge Amigo 14k

0

Entering edit mode

She does have rs IDs for her my_data entries and I thought as you do that there'd be an intermediate assignment step by getting rs IDs for the 1000G entries she's using. LD could be another issue here, but it is not clear where she's going once the data entries are related to one another.

ADD REPLY • link 12.3 years ago by Larry_Parnell 16k

score 1 · Answer 2 · 2012-01-04

It is possible to get the positional information up by querying dbSNP or ensembl biomart on rs-ids. As has been noted, it is very important to use the correct genome build to get the coordinates right. I am not sure on which genome build your data relies. It looks like your 1000G file is in VCF format, then it is likely downloaded from here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/release/2010_07/

To identify the the correct genome build look at the header of the vcf file: does it contain a ##reference tag? In a random vcf file from the pilot, it reads:

##reference=human_b36_both.fasta

That would tip me off the fact that NCBI build 36 was used and not the latest build 37.

Now, you have two options, either query the latest dbSNP version (currently 135) with rs-ids and then translate the VCF coordinates between genome builds, or query the last dbSNP version that was based on the 36 build. That is dbSNP build 130.

I would go for the first option and e.g. use LiftOver to translate the coordinates in your file to the latest genome build, therefore you will most likely need to convert the VCF file to BED format. Using BEDtools, you can then intersect the coordinates.

score 0 · Answer 3 · 2012-01-04

if you want to match 2 datasets you need common fields that are able to uniquely identify any particular entry on each one, but I don't see any field that would do on the data that you describe above, so I you need to do some extra work on either one or the other dataset:

on your "my_data" side by retrieving the chromosome and position that correspond to the rscode reported and then crossmatch them with the chromosome and position reported on the 1000 genomes side (this would be what I would do)
on the 1000 genomes side by looking into other table which reports the rscode of each variant or retrieving the rscode for each chromosome position listed.