Question

Forum:When will the hg19/grch37 finally become obsolete?

1

Entering edit mode

4.9 years ago

gdaly9000 ▴ 10

I have recently run into the issue of finding relatively recent databases and papers which align to the old reference assembly (HG 19 / GRhC37).

I understand that it is non-trivial to change alignments, but this assembly was last patched in 2013. For novel projects, especially, it should be trivial to use the latest assembly.

Can someone please help me understand why new projects are using the older assembly?

Reference-Genome • 1.7k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.9 years ago by gdaly9000 ▴ 10

0

Entering edit mode

GRCh37, IMO, is the equivalent of IBM mainframe computers, especially with clinical informatics. It is a sequence that works well enough to not be a pain to most research folks. IIRC, even NCBI/dbSNP only made the switch relatively recently. I think GRCh37 is going to be around for a few more years at least. It will take major tools deprecating support for GRCh37 for the forced switch process to start.

ADD REPLY • link 4.9 years ago by Ram 43k

score 2 · Answer 1 · 2019-05-31

i wouldn't hold your breath

Pubmed central full-text searches e.g. https://www.ncbi.nlm.nih.gov/pmc/?term=((%222015%22%5BElectronic+Publication+Date%5D+%3A+%222015%22%5BElectronic+Publication+Date%5D))+AND+grch37%5BText+Word%5D

        hg38     hg19    grch37   grch38
2019     279     1050       511      383
2018     728     3354      1550      997
2017     567     3730      1590      764
2016     299     3413      1441      465
2015     105     2977      1224      228

some reproducible code to plot cumulative mentions

library(ggplot2)
library(lubridate)
library(tidyr)
library(dplyr)

freezes<-read.table("freezes.txt",sep="\t",header = TRUE)

freezes %>% 
  gather(term, mentions, -year) %>% 
  group_by(term) %>% 
  arrange(year) %>%
  mutate(cmentions = cumsum(mentions)) %>% 
  mutate(year=ifelse(year==2019,2019+(yday(Sys.Date())/365),year+1)) %>%
  ungroup() %>%
  mutate(term=factor(term,levels=c("hg19","grch37","grch38","hg38"))) -> 
  freeze_plotable

ggplot(freeze_plotable,aes(year,cmentions))+geom_smooth(aes(color=term))

enter image description here

score 1 · Answer 2 · 2019-05-31

1

Entering edit mode

4.9 years ago

predeus ★ 1.9k

I'll go on a limb here and say that there is no clear benefit of switching from hg19 to hg38 for most purposes.

So hg19 will become obsolete same time as hg38, when using graph-based genome reference will become feasible.

ADD COMMENT • link 4.9 years ago by predeus ★ 1.9k

score 0 · Answer 3 · 2019-05-31

I think some databases are only for hg19. You could lift them over, but I would guess some things will be different for a direct alignment versus a liftOver annotation. For example, some variant frequencies should be affected by the genome build and/or pre-processing steps.

I think having multiple genome options is good (and, if you are looking at certain genes, like HLA genes, using a custom reference may be preferable to the main genome build).

However, if you are mostly comparing to previous samples (say, 90% of your data is already aligned to hg19), I think many analyses will be OK for adding new samples to the previous alignment. However, if you encounter a strange result, seeing if updating the reference (or other steps) may be worth checking (if nothing else, it could give you confidence the issue is less likely to be due to the genome reference).

I think the genome build is a bigger issue for other organisms with fewer genome revisions. Some programs like i-cisTarget provide analysis for the older mouse build (mm9) as well (and I think there are some situations where that program can be useful). However, I'm having a hard time thinking of an example of a resource where the reference sequence / curation is at an even earlier stage.