Why is human genome FASTA file on GENCODE much smaller than that on ENSEMBL?
2
1
Entering edit mode
3.0 years ago
Xiaokang ▴ 70

I have some RNA sequencing reads to align to the human reference genome. I found the genome FASTA files on both GENCODE and ENSEMBL: GRCh38.p13.genome.fa.gz and Homo_sapiens.GRCh38.dna.toplevel.fa.gz But after unzipping them, I found that they are 3.1G and 60G respectively. Why is that? And which one should I use? (considering the purpose of the project is to detect gene fusion from the sequencing reads).

reference genome GENCODE ENSEMBL • 2.9k views
ADD COMMENT
6
Entering edit mode
3.0 years ago
GenoMax 141k

toplevel file from Ensembl includes haplotypes with full length of chromosome padded out using N's. That is the reason it is huge compared to GENCODE file. Use the Ensembl primary file which is equivalent to GENCODE.

From README at Ensembl:

---------
TOPLEVEL
---------
These files contains all sequence regions flagged as toplevel in an Ensembl
schema. This includes chromsomes, regions not assembled into chromosomes and
N padded haplotype/patch regions.
ADD COMMENT
0
Entering edit mode

Haha, fun truth. But then I'm wondering, can I use the same GTF annotation file on the toplevel and primary FASTA file? Or rather, will the N confuse the coordinates in GTF?

ADD REPLY
0
Entering edit mode

Do not use toplevel file unless you have a specific reason to do so i.e. you need to use the haplotypes.

ADD REPLY
0
Entering edit mode

Trying to compare the 3 files in question, and found that there are 639 sequences in both GENCODE genome and ENSEMBL toplevel, but only 194 sequences in ENSEMBL primary.

ADD REPLY
0
Entering edit mode

Yet, I want to add a further question after we decide on primary FASTA: there are 3 FASTA files flagged with primary, which are Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz (unmasked genomic DNA sequences), Homo_sapiens.GRCh38.dna_rm.primary_assembly.fa.gz (masked genomic DNA), Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz (soft-masked genomic DNA). As this article says, we should avoid using rm. But how to choose between the other two, when it comes to (short/long-read) RNA-Sequencing alignment?

ADD REPLY
2
Entering edit mode

Use primary unmasked genome. See: Masking reference for RNA-seq alignments

  • 'dna_rm' - masked genomic DNA. Interspersed repeats and low complexity regions are detected with the RepeatMasker tool and masked by replacing repeats with 'N's.
    • 'dna_sm' - soft-masked genomic DNA. All repeats and low complexity regions have been replaced with lowercased versions of their nucleic base
ADD REPLY
3
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6