News:The New Grch38 Human Genome Browser Has Arrived!
5
19
Entering edit mode
10.1 years ago

The new version of the UCSC browser has been released. I am posting here in Biostar because some new features may change some bioinformatics pipeline, like the alternate reference loci, or the Analysis data set sequence to be used for NGS analysis.

Here is the release note, from the UCSC mailing list:

In the final days of 2013, the Genome Reference Consortium (GRC) released the eagerly awaited GRCh38 human genome assembly, the first major revision of the human genome in more than four years. During the past two months, the UCSC team has been hard at work building a browser that will let our users explore the new assembly using their favorite Genome Browser features and tools. Today we're announcing the release of a preliminary browser on the GRCh38 assembly. Although we still have plenty of work ahead of us in constructing the rich feature set that our users have come to expect, this early release will allow you to take a peek at what's new.

Starting with this release, the UCSC Genome Browser version numbers for human assemblies will match those of the GRC to minimize version confusion. Hence, the GRCh38 assembly is referred to as hg38 in Genome Browser datasets and documentation. We've also made some slight changes to our chromosome naming scheme that affect primarily the names of haplotype chromosomes, unplaced contigs and unlocalized contigs. For more details about this, as well as information about the GRCh38 assembly files, statistics, and links for downloading the UCSC data files, see the Genome Browser hg38 gateway page.

What's new in GRCh38?

  • Alternate sequences - Several human chromosomal regions exhibit sufficient variability to prevent adequate representation by a single sequence. To address this, the GRCh38 assembly provides alternate sequence for selected variant regions through the inclusion of alternate loci scaffolds (or alt loci). Alt loci are separate accessioned sequences that are aligned to reference chromosomes. This assembly contains 261 alt loci, many of which are associated with the LRC/KIR area of chr19 and the MHC region on chr6.
  • Centromere representation - Debuting in this release, the large megabase-sized gaps that were previously used to represent centromeric regions in human assemblies have been replaced by sequences from centromere models created by Karen Miga et al. using centromere databases developed during her work in the Willard lab at Duke University and analysis software developed while working in the Kent lab at UCSC. The models, which provide the approximate repeat number and order for each centromere, will be useful for read mapping and variation studies.
  • Mitochondrial genome - The mitochondrial reference sequence included in the GRCh38 assembly and hg38 Genome Browser (termed "chrM" in the browser) is the Revised Cambridge Reference Sequence (rCRS) from MITOMAP with GenBank accession number J01415.2 and RefSeq accession number NC_012920.1. This differs from the chrM sequence (RefSeq accession number NC_001907) used by the previous hg19 Genome Browser, which was not updated when the GRCh37 assembly later transitioned to the new version.
  • Sequence updates - Several erroneous bases and misassembled regions in GRCh37 have been corrected in the GRCh38 assembly, and more than 100 gaps have been filled or reduced. Much of the data used to improve the reference sequence was obtained from other genome sequencing and analysis projects, such as the 1000 Genomes Project.
  • Analysis set - The GRCh38 assembly offers an "analysis set" that was created to accommodate next generation sequencing read alignment pipelines. Several GRCh38 regions have been eliminated from this set to improve read mapping. The analysis set may be downloaded from the Genome Browser downloads page.

There's much more to come! This initial release of the hg38 Genome Browser provides a rudimentary set of annotations. Many of our annotations rely on data sets from external contributors (such as our popular SNPs tracks) or require massive computational effort (our comparative genomics tracks). In the upcoming months/years, we will release many more annotation tracks as they become available. To stay abreast of new datasets, join our genome-announce mailing list or follow us on twitter.

We'd like to thank our GRC and NCBI collaborators who worked closely with us in producing the hg38 browser. Their quick responses and helpful feedback were a ke4y factor in expediting this release. The production of the hg38 Genome Browser was a team effort, but in particular we'd like to acknowledge the engineering efforts of Hiram Clawson and Brian Raney, the QA work done by Steve Heitner, project guidance provided by Ann Zweig, Robert Kuhn, and Jim Kent, and documentation work by Donna Karolchik.

ucsc assembly GRCh38 hg38 • 26k views
ADD COMMENT
0
Entering edit mode

For convenience, here is a post describing tools for converting coordinates between genome versions: Converting Genome Coordinates From One Genome Version To Another (Ucsc Liftover, Ncbi Remap, Ensembl Api)

ADD REPLY
6
Entering edit mode
10.1 years ago

I know what is going to happen next:

< image not found >

ADD COMMENT
5
Entering edit mode
10.1 years ago
zx8754 11k

At least naming convention finally consistent:

< image not found >

ADD COMMENT
4
Entering edit mode
9.5 years ago
Denise CS ★ 5.2k

There was a thread this week on Ensembl dev mailing list giving more details on this.

This is a summary of Andy Yates' reply:

If Ensembl were to change from 'Y' to 'chrY', the time when we released GRCh38 (Ensembl release 76), it would have been the perfect time for that change! This would have been applied to human only, leaving our remaining species (i.e. other 70 or so) out and leading to inconsistencies in our db.

If we were to change, there could be a risk of having things such as chrchrY being created and causing bugs in users' pipelines.

The take home message from that post is that 'No, Ensembl will NOT change'. This does not mean though the Ensembl team won't review the issue but for the time being, that's the case.

We will endeavour to 'provide an easy way to map between the namespaces...and to provide documentation about how to use Ensembl data in common pipelines'. For more info, just contact us here or elsewhere.

Andy's detailed reply can be viewed on the the dev list archive

ADD COMMENT
0
Entering edit mode

Upvote because of info update, not because of non-unification of the chromosome names across browsers.

ADD REPLY
3
Entering edit mode
10.1 years ago

Why does UCSC insist on having their own chromosome naming scheme? I don't know any bioinformatician that hasn't been frustrated by this.

Update: UCSC uses the same chrom names as the GENCODE GTF. According to Brad's notes below, NCBI's Refseq GFFs will soon have the same chrom names. Ensembl GTFs are the only holdout, and the rumor is that they will start using the same conventions. So fret not, bioinformaticians of the future!

Update #2: See Denise's comment above. According to this thread, UCSC's chr-prefixed chrom names will not be adopted by Ensembl. It will break too many things for their users, especially considering that their framework needs to support genomes other than just human. So we're back to square one, where you gotta choose a side when building your pipelines. I vote for the side with properly versioned biannual API+data releases.

ADD COMMENT
3
Entering edit mode

NCBI, UCSC and Ensembl did a lot of work on GRCh38 to ensure the chromosome naming and ordering was consistent and there wouldn't be multiple versions of GRCh38 like hg19/GRCh37:

wget -O - ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/GCA_000001405.15_GRCh38_full_analysis_set.fna.gz | gunzip -c | head -1
>chr1  AC:CM000663.2  gi:568336023  LN:248956422  rl:Chromosome  M5:6aef897c3d6ff0c78aff06ac189178dd  AS:GRCh38

See the README on the analysis set distribution from NCBI for more details:

ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/README_ANALYSIS_SETS

ADD REPLY
2
Entering edit mode

By "did a lot of work", you must mean that they finally talked to each other. ;) But this is good news, if they are planning to sync up their chrom names.

Though, it is bizarre that they would standardize on the chr-prefix. "chrEBV" doesn't even make sense... unless you accept a loose definition of "chromosome".

ADD REPLY
0
Entering edit mode

so ensembl is using the "chr" prefix?

ADD REPLY
3
Entering edit mode

That's the plan, based on conversations from just before the GRCh38 release. If things have changed I'm sure @Emily_Ensembl will chime in to tell everyone I'm totally wrong and should not be trusted. The goal was to try and unify naming and have one consistent set of coordinates and FASTA reference, so the NCBI analysis set also has GATK-ready ordering, the mitochondrial chromosome, PAR masking and all the other work people did post-GRCh37 to cause all the divisions in references. Fingers crossed.

ADD REPLY
2
Entering edit mode

So these discussions seems to have failed.

$ curl -s ftp://ftp.ensembl.org/pub/release-81/gtf/homo_sapiens/Homo_sapiens.GRCh38.81.gtf.gz|gzip -cd|head
#!genome-build GRCh38.p3
#!genome-version GRCh38
#!genome-date 2013-12
#!genome-build-accession NCBI:GCA_000001405.18
#!genebuild-last-updated 2015-06
1    havana    gene    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2";
1    havana    transcript    11869    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; tag "basic"; transcript_support_level "1";
1    havana    exon    11869    12227    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
1    havana    exon    12613    12721    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
1    havana    exon    13221    14409    .    +    .    gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; havana_gene "OTTHUMG00000000961"; havana_gene_version "2"; transcript_name "DDX11L1-002"; transcript_source "havana"; transcript_biotype "processed_transcript"; havana_transcript "OTTHUMT00000362751"; havana_transcript_version "1"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";

$ curl -s ftp://ftp.ensembl.org/pub/release-81/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.1.fa.gz|gzip -cd|head

>1 dna:chromosome chromosome:GRCh38:1:1:248956422:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

So it's back to square -5 for the bioinformaticians of the world.

ADD REPLY
1
Entering edit mode

At the moment ENSEMBL is able to translate chr1 => 1 when you display a bam or bw file as a custom track. So ENSEMBL is more flexible in this regard. I still hope the major browsers get some name unification going. Especially for bacteria (I could not name a genome so it would display in both http://microbes.ucsc.edu/ and ENSEMBL) and other organisms where chromosomes are named in roman letters etc...

ADD REPLY
0
Entering edit mode
8.4 years ago
brianpenghe ▴ 80

The official info is here.

To convert between hg19 and hg38, you could use the UCSC liftover tool.

ADD COMMENT

Login before adding your answer.

Traffic: 2291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6