Biostar Beta. Not for public use.
Which genome files to use for STAR?
0
Entering edit mode
12 months ago
Nico80 • 0
University of Edinburgh, UK

I am trying to build a genome index for use with STAR, and I am a bit confused on which files I should use.

According to the STAR manual (§2.2.1)

It is strongly recommended to include major chromosomes (e.g., for human chr1-22,chrX,chrY,chrM,) as well as un-placed and un-localized scaffolds. Typically, un-placed/un-localized scaffolds add just a few MegaBases to the genome length, however, a substantial number of reads may map to ribosomal RNA (rRNA) repeats on these scaffolds. These reads would be reported as unmapped if the scaffolds are not included in the genome, or, even worse, may be aligned to wrong loci on the chromosomes. Generally, patches and alternative haplotypes should not be included in the genome.

I have downloaded the following:

wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{1..22}.fa.gz
wget ftp://ftp.ensembl.org/pub/release-96/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.{MT,X,Y}.fa.gz

I have not downloaded the masked genomes (_rm and _sm), but what about the following files?

Homo_sapiens.GRCh38.dna.nonchromosomal.fa.gz: are these the scaffold reads the STAR manual is talking about? The README file on the ENSEMBL FTP seems to imply scaffold reads are in seqlevel files, but I cannot see any.

Homo_sapiens.GRCh38.dna.toplevel.fa.gz: the README states this

contains all sequence regions flagged as toplevel in an Ensembl schema. This includes chromsomes, regions not assembled into chromosomes and N padded haplotype/patch regions.

So, according to the STAR manual I should not include this, is this correct?

Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz This contains

all toplevel sequence regions excluding haplotypes and patches.

So could I just use this instead of the chromosome files above? Or should I use it in addition?

ADD COMMENTlink
1
Entering edit mode

Just use Homo_sapiens.GRCh38.dna.primary_assembly.fa for reference, it doesn't make sense to concatenate all the other files to get the same file.

ADD REPLYlink
0
Entering edit mode

Thank you Benn, just out of curiosity, could you confirm whether my understanding of what the different files are is correct?

ADD REPLYlink
0
Entering edit mode

I don't know the answers to all your questions about what's in the different files or not, if you are interested you can download them and see what's in it. The STAR manual tells us that Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz is an acceptable file to use, so that's why I recommended you to use it. Good luck with the mapping.

ADD REPLYlink
0
Entering edit mode

You will get the reference genome here: https://data.broadinstitute.org/Trinity/CTAT_RESOURCE_LIB/

SOURCE: [Click here ---> https://github.com/STAR-Fusion/STAR-Fusion/wiki] ----> go to Data Recource Required

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1