Question

Checking chromosome builds for genotyping data

0

Entering edit mode

2.6 years ago

mari.johnson • 0

Hi,

I have several studies worth of data (In both PLINK and vcf format), and I was wondering if anyone knew of an online tool which I could use to check my chromosome build i.e GRCh37 vs GRCh38. (I thought I used one before but I am since unable to find it).

Or if there are any other suggestions for how to best check this. I guess I could try merging study data with the reference builds and see what aligns best, but this will probably take a while to go through.

Thanks for any help,

Mari

GWAS Liftover PLINK • 744 views

ADD COMMENT • link updated 14 days ago by Muhammad • 0 • written 2.6 years ago by mari.johnson • 0

score 1 · Answer 1 · 2022-03-10

While this should be known to the analyst going in, for a lot of reasons it isn't always ... there are more and less rigorous ways of determining this, but the simplest way I can think of is pick one marker per chromosome, and find which build matches the chr:position in every case ... after only a few you'll know the answer.

score 0 · Answer 2 · 2024-05-18

Assume you have genotype data in Plink format. (G.bed, G.bim, G.fam) Download some reference panel data in Plink format (R.bed, R.bim, R.fam) in a specific build (hg19 or hg38).

Generate one more copy of the reference panel with a build different from the one in which it was initially available. (PyLiftOver)

from pyliftover import LiftOver
import pandas as pd
df = pd.read_csv("ref38.bim",names=['CHR','SNP','Distance','BP','A1','A2'],sep="\s+")
print(df.head())

#df =df.head()


df['chromosome'] = "chr"+df["CHR"].astype(str)
df['position_GRCh38'] = df["BP"].values

def convert_coordinates_batch(rows):
    converted_coordinates = []
    lo = LiftOver('hg38','hg19')
    for index, row in rows.iterrows():
        converted_coordinate = lo.convert_coordinate(row['chromosome'], row['position_GRCh38'])
        converted_position = converted_coordinate[0][1] if converted_coordinate else None
        converted_coordinates.append(converted_position)
    return converted_coordinates
batch_size = 50


for i in range(0, len(df), batch_size):
    df.loc[i:i+batch_size-1, 'position_GRCh37'] = convert_coordinates_batch(df.iloc[i:i+batch_size])

#df['position_GRCh38'] = df.apply(lambda row: convert_coordinate_GRCh37_to_GRCh38(row['chromosome'], row['position_GRCh37']), axis=1)

df["BP"] = df['position_GRCh37'].values

print(df.head())
#df["BP"] = df["BP"].astype(int)

del df['position_GRCh38']
del df['position_GRCh37']
del df['chromosome']

print(df.head())
df.to_csv("ref19.bim",sep="\t",index=False,header=False)

Now you have R.hg19.bim (Reference panel in build 19) R.hg38.bim (Reference panel in build 38)

Count the number of common variants (C19) between G.bim and R.hg19.bim. Count the number of common variants (C38) between G.bim and R.hg38.bim.

bimfile["match"] = bimfile[0].astype(str)+"_"+bimfile[3].astype(str)+"_"+bimfile[4].astype(str)+"_"+bimfile[5].astype(str)
df["match"] = df["CHR"].astype(str)+"_"+df["BP"].astype(str)+"_"+df["A1"].astype(str)+"_"+df["A2"].astype(str)
df.drop_duplicates(subset='match', inplace=True)
bimfile.drop_duplicates(subset='match', inplace=True)
hg19variants = len(df[df['match'].isin(bimfile['match'].values)])
print(hg19variants)

df = pd.read_csv(GWAS, compression="gzip",sep="\s+",on_bad_lines='skip')    
bimfile = pd.read_csv("genotypes.bim38",sep="\s+",header=None)
bimfile["match"] = bimfile[0].astype(str)+"_"+bimfile[3].astype(str)+"_"+bimfile[4].astype(str)+"_"+bimfile[5].astype(str)
df["match"] = df["CHR"].astype(str)+"_"+df["BP"].astype(str)+"_"+df["A1"].astype(str)+"_"+df["A2"].astype(str)
df.drop_duplicates(subset='match', inplace=True)
bimfile.drop_duplicates(subset='match', inplace=True)
hg38variants = len(df[df['match'].isin(bimfile['match'].values)])
print(hg38variants)

if hg19variants > hg38variants:
    return "19"
if hg38variants > hg19variants:
    return "38"

Match: CHR:BP/Position/:A1:A2

If C19>C38:
  G is hg19
else:
  G is gh28