Question

Assigning SNPs to gene names based on genomic coordinates

1

Entering edit mode

5.4 years ago

paolo002 ▴ 160

Hi all, I am not sure if this was asked before but I would like to assign a list of SNPs to the corresponding gene names:

I have a df such as the following:

           SNP std.XPEHH CHR       BP     rank
1  rs376121383   5.86403   2 97868880       1
2    rs9324211   5.86122   2 97867870       2
3   rs12445490   5.85442  16  5570974       3
4  rs201474380   5.84314   2 97868313       4
5  rs190916690   5.79274   2 97868544       5
6  rs112920594   5.79176   2 97866321       6
7  rs372154564   5.79176   2 97866355       7
8   rs13409957   5.78445   2 97865894       8
9   2:97868257   5.78096   2 97868257       9
10 rs116237095   5.78096   2 97868240      10
11 rs181446057   5.77994   2 97868213      11
12 rs370914866   5.77985   2 97867072      12
13  2:97865943   5.77978   2 97865943      13
14 rs115004312   5.76058   2 97866683      14
15 rs192731212   5.76058   2 97866908      15
16 rs371563138   5.75741   2 97867703      16
17 rs111558429   5.75590   2 97867824      17
18   rs9636497   5.73166   2 97868926      18
19   rs3926003   5.71338   2 97870247      19
20  rs12327919   5.70304   2 97866928      20
21 rs116820074   5.64604  16  5570936      21
22  16:5570973   5.62932  16  5570973      22
23 rs115661023   5.61264   2 97865044      23
24  rs55879702   5.60655   2 97866370      24
25 rs377749588   5.60236   2 97865217      25
26 rs148207230   5.59137  16  5564330      26
27  rs10169898   5.57194   2 61103311      27
28   rs9636495   5.56266   2 97869056      28
29   rs8053291   5.53916  16  5566143      29
30  rs10168001   5.53839   2 97867470      30
....

BP is the coordinate of the corresponding SNP in base pair on the genome. Some of the SNPs are not annotated, so the ID is like number 9 in the df, with the chromosome number and base pair location. I would like to know in which gene the BP number is falling and so I can assign SNPs to their gene. I would like to do this in R and the gene name should be accordingly to hg19 UCSC and not ESEMBL, it would be better. Any help highly appreciated, thanks a lot Cheers

R • 2.7k views

ADD COMMENT • link updated 2.2 years ago by zx8754 11k • written 5.4 years ago by paolo002 ▴ 160

2

Entering edit mode

This has been asked before many times, search for "map SNP to gene":

ADD REPLY • link 5.4 years ago by zx8754 11k

0

Entering edit mode

ok thanks a lot for all the links. I am taking a look.

ADD REPLY • link 5.4 years ago by paolo002 ▴ 160

1

Entering edit mode

5.4 years ago

Emily 23k

The Ensembl VEP will tell you what genes they map to and the effects they have.

ADD COMMENT • link 5.4 years ago by Emily 23k

0

Entering edit mode

4th link covers this already, why answer? Maybe update/edit your existing answer from that link?

ADD REPLY • link 5.4 years ago by zx8754 11k

zx8754 · Accepted Answer · 2018-12-06

Hi, I have found a nice solution for doing that in R, I post it so maybe it might be useful to other people:

Let' say you have a data frame (df) with list of SNPs ID (rs IDs and not annotated IDs) with chromosome number and location in Base Pairs (BP) on the genome ( I show the first 10 rows but the df continues with other many rows):

           SNP std.XPEHH  CHR    BP      rank
1  rs376121383   6.48482 chr2 97868880         1
2  rs201474380   6.47901 chr2 97868313         2
3    rs3926003   6.46181 chr2 97870247         3
4  rs112920594   6.42510 chr2 97866321         4
5  rs372154564   6.42510 chr2 97866355         5
6  rs190916690   6.42453 chr2 97868544         6
7  rs370914866   6.41392 chr2 97867072         7
8   2:97865943   6.40923 chr2 97865943         8
9   2:97868257   6.40907 chr2 97868257         9
10 rs116237095   6.40907 chr2 97868240        10

In order to get the following output data frame:

   ENTREZID      rank         SNP std.XPEHH CHR    BP   GENE NAME
1     11066         1 rs376121383   6.48482 chr2 97868880 SNRNP35
2     11066         2 rs201474380   6.47901 chr2 97868313 SNRNP35
3     11066         3   rs3926003   6.46181 chr2 97870247 SNRNP35
4     11066         5 rs372154564   6.42510 chr2 97866355 SNRNP35
5     11066         4 rs112920594   6.42510 chr2 97866321 SNRNP35
6     11066         6 rs190916690   6.42453 chr2 97868544 SNRNP35
7     11066         7 rs370914866   6.41392 chr2 97867072 SNRNP35
8     11066         8  2:97865943   6.40923 chr2 97865943 SNRNP35
9     11066         9  2:97868257   6.40907 chr2 97868257 SNRNP35
10    11066        10 rs116237095   6.40907 chr2 97868240 SNRNP35

It is possible to add the corresponding gene name to the SNP location by converting the df to GRanges object and use the findOverlaps function to check which of the SNPs coordinates are intersecting in which genes (based on their coordinates in the TxDb.Hsapiens.UCSC.hg19.knownGene database as GRanges object):

#To find overlapping SNPs coordinates to gene coordinates  Load following libraries
library(GenomicRanges)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
head(df)
# add rank column to the df (this will be usuful later on when fining overlapping rows, already added in the previous df)
df <- df %>% mutate(rank = 1:nrow(df))
# add "chr" before chromosome number in the data frame df in the CHR column (already added in the previous df)
df$CHR <- sprintf('chr%s', df$CHR)
# convert the data frame to GRanges object,
# it takes the chromosome number and the SNP coordinate on Base Pair (BP)
SNPs.gr <- GRanges( seqnames=df$CHR,
                    IRanges(start=df$BP,
                            width=1))
headSNPs.gr)    
# add the info of the reference genome
genomeSNPs.gr) <- 'hg19'
# get the coordinates of all the genes from database
txs <- transcriptsBy(TxDb.Hsapiens.UCSC.hg19.knownGene)
head(txs)
# find which coordinates of the SNP list are folling into the coordinates of the gene list in the data base
df_overlap<-findOverlaps(df,txs)
#modify name of first column of the previous output to merge with original df
names(df_overlap)[1]<-"rank"
df_merged<-merge(df, df_overlap, by="rank", all=T)
head(df_merged)

# To convert to ENTREZ IDs to HUGO gene names:
library(EnsDb.Hsapiens.v86)
hsens=EnsDb.Hsapiens.v86
my.symbols <- as.character(df_merged$subjectHits)
df_HUGO<-select(hsens,  
       keys = my.symbols, 
       columns = c("ENTREZID", "SYMBOL"), 
       keytype = "ENTREZID")
names(df_merged)[6]<-"ENTREZID"
df_final<-merge(df_merged, df_HUGO, by="ENTREZID", all=T)