Question

Should and can we use the new draft pangenome reference?

3

Entering edit mode

11 months ago

Chris ▴ 260

Hi all, I follow the paper https://www.nature.com/articles/s41586-023-05896-x#data-availability to go into https://anvilproject.org/data/consortia/HPRC/workspaces. But I didn't see the fastq file of the new reference genome. Can and should we use the new draft pangenome reference to replace the GRCh38? If should and yes, a link to get that fastq file would be really appreciated. Thank you so much for your advice.

reference-genome • 1.7k views

ADD COMMENT • link 10 months ago by Chris ▴ 260

score 4 · Answer 1 · 2023-06-28

Chris - thanks for a very timely question.

N.B.: I will frame this answer not so much with respect to the HPRC as its predecessor, the T2T consortium. Generally, both the strengths and weaknesses I mention for T2T CHM13v2, below, are amplified when considering HPRC at present because it is even newer, even fewer tools and annotations exist, but it is in the abstract even better than T2T.

At present, I'd argue whether or not you use the pangenome assembly (or CHM13 v2) depends on the specific use case(s) that motivated generation of the datasets to-be-assembled. This is because of a key trade-off: generally, T2T and HPRC assemblies both increase the power to detect certain genomic elements and reduce false positives; i.e., they are "better" for base level genomics.

However, these theoretical advantages have to be considered against practical barriers that still exist at present. I'll try to provide a more granular view, below:

1) The use of gapless assemblies has already been shown to improve performance on the core genomic tasks of assembly, alignment, variant ascertainment, etc., as presented in a nice T2T companion paper, Aganezov et al.. Its worth reading through this in toto, but as stated in the abstract:

We identify hundreds of thousands of variants per sample in previously unresolved regions, showcasing the promise of the T2T-CHM13 reference for evolutionary and biomedical discovery. Simultaneously, this reference eliminates tens of thousands of spurious variants per sample, including reduction of false positives in 269 medically relevant genes by up to a factor of 12. Because of these improvements in variant discovery coupled with population and functional genomic resources, T2T-CHM13 is positioned to replace GRCh38 as the prevailing reference for human genetics.

while this was written about T2T-CHM13, similar things are true regarding the pangenome reference assembly, with the added bonus that, if you are studying individuals from ancestral populations that have not had much sequencing performed to date, these gains might be amplified a bit. The abstract mentions false positive reduction in medically relevant genes by "up to a factor of 12". Part of knowing whether it is "worth it" to go to the new assemblies is knowing where these regions are. You can find annotations for where the "new" sequences that were either gaps, flagged, or misrepresented on prior builds, are, by reading documentation for GRCh38, through resources like the CHM13 unique track, reading the core T2T/HPRC papers (e.g. Nurk et al. 2022) etc. If you release you are studying genes in such regions, it might be relatively more important for you. This is in effect what motivated Evan Eichler to start the process of recommending that a CHM be used as a genomic resource more than 20 years ago (he is interested in Autism).

2) Tradeoff: different levels of annotation

There are some very practical reasons not to switch yet that relate to the amount/variety of data resources and companion sets annotated for T2T/HPRC contra GRCh38. To get a sense for this, scroll through the annotations presented on the CHM13 github page: https://github.com/marbl/CHM13. As you can see, key resources like ClinVar have been lifted-over, and these efforts are impressive. Nevertheless, there are FAR more annotations for the GAIB builds at present. Visually, you can see this same thing by going to the UCSC genome browser URL for each build: Check out the number of tracks available at the bottom of the page for GRCh38 (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38) in comparison to T2T CHM13v2 (https://genome.ucsc.edu/cgi-bin/hgTracks?db=hub_3671779_hs1).

3) Structural variant detection, epistasis, haplotyping, and if you have 3rd gen (nanopore, SMRT) sequences

Generally, if you are making a serious study of SV, long(er) range epistasis, or haplotype effects, I think it is worth at least exploring the newer assemblies, in particular if you have 3rd gen sequencing data. This is because ascertainment of these phenomena are particularly improved; for instance we now know we were missing a large percentage of (especially large) inversion variants. While these gains are predicated upon the use of newer technologies, the choice of the new build to use also plays a role. So, for questions relating to the above, I might actually use both and draw on the strengths of one or the other depending on the goal.

Summary:

In summary, I'd recommend carefully considering what the goals of your study are. If you think you are likely to care more about whether a region you are interested in has abundant epigenomic / enhancer annotations, GRCh38 is probably going to prove easier. If you are studying a rare disease you think could be motivated by a SV or haplotype effects, I'd recommend both 3rd gen sequencing and use of T2T/HPRC resources.

Data Availability: You also mentioned question relating to data availability. Both T2T and HPRC are open consortia. To get the most up to date data (and you are going to want this if you are seriously considering using HPRC), I'd join the consortia and stay apprised of new and upcoming data releases. The number of genomes in the HPRC is steadily increasing, while major databases are likely to lag behind slightly.

score 3 · Answer 2 · 2023-05-18

3

Entering edit mode

11 months ago

Emily 23k

It's a reference assembly, which would suggest that it's FASTA not FASTQ. Ensembl have file dumps here.

ADD COMMENT • link 11 months ago by Emily 23k

0

Entering edit mode

Thanks Emily :)

ADD REPLY • link 11 months ago by Chris ▴ 260

0

Entering edit mode

Hi Emily. I see there are many assemblies so if I know my sample comes from a population such as African Caribbean In Barbados I should download the equivalent reference genome in the list?

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

depending what your goals are, you'd likely actually want to download many genomes to assist with assembly of a person having such ancestry.

genetic admixture of persons living in the caribbean can reflect one, two, or several ancestral populations; e.g. west african (yoruban), western european (e.g., spanish), as well as sequences from persons indigenous to the area. even this latter is fairly complicated, reflecting island and location-specific effects that reflect historical events...

thus, to ensure that you have the most similar haplotypes available, i would download at minimum all the haplotypes for each of these candidate populations...

but if you are doing that, better just to download all of them - again, this maximizes your chances of having good comparator haplotypes at each locus.

ADD REPLY • link 10 months ago by LauferVA 4.2k