Question

Hg19 reference genome for bedtools getfasta

0

Entering edit mode

5.2 years ago

bioinfo456 ▴ 150

Where can I download the required reference genome from? I noticed it is available in .2bit format in UCSC. And when I downloaded it from the NCBI I'm getting the following error while using bedtools getfasta. Please help.

WARNING. chromosome (22) was not found in the FASTA file. Skipping.

assembly genome • 2.5k views

ADD COMMENT • link 5.2 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Does your BED file happen to have 22 instead of chr22, so no "chr" prefix?

ADD REPLY • link 5.2 years ago by ATpoint 81k

0

Entering edit mode

Even after prefixing, I'm having the same error. I believe it has something to do the the reference genome. Do you reckon? Also I noticed that the index file generated for the reference genome is very different, here's a snippet :

NC_000001.10    249250621   69  80  81
NT_113878.1 106433  252366418   80  81
NT_167207.1 547496  252474277   80  81
NC_000002.11    243199373   253028686   80  81
NC_000003.11    198022430   499268121   80  81
NC_000004.11    191154276   699765901   80  81
NT_113885.1 189789  893309701   80  81
NT_113888.1 191469  893501958   80  81
NC_000005.9 180915260   893695889   80  81
NC_000006.11    171115067   1076872659  80  81
NC_000007.13    159138663   1250126734  80  81
NT_113901.1 182896  1411254726  80  81
NC_000008.10    146364022   1411439978  80  81
NT_113909.1 38914   1559633646  80  81
NT_113907.1 37175   1559673142  80  81

ADD REPLY • link 5.2 years ago by bioinfo456 ▴ 150

1

Entering edit mode

Your chromosomes have the NCBI contig identifier, not chromosome names. All the files you use across a pipeline need to follow the same naming convention for chromosomes. Software is dumb, it has no idea that NC_00001.10 and chr1 mean the same thing in your current context.

ADD REPLY • link 5.2 years ago by Ram 43k

1

Entering edit mode

Obviously, the identifiers are not standard chromosome names, neither something like 22 or chr22 but the contig names from NCBI. This non-sense nomenclature (at least in terms of standard usage as reference genome) is part of why I at all costs try to avoid anything that comes from RefSeq-related pages. I would get the genome from the UCSC:

rsync -avzP rsync://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/ .

and then use cat to combine all chromosomes to one genome file, followed by indexing and running bedtools. Note that UCSC has a nomenclature with chr prefix so you'll have to add this to your BED file.

ADD REPLY • link 5.2 years ago by ATpoint 81k

0

Entering edit mode

Alright thanks for the clarification both of you. Regarding indexing, it's going to happen automatically when I run bedtools getfasta, right?

ADD REPLY • link 5.2 years ago by bioinfo456 ▴ 150

0

Entering edit mode

Try running it or reading the documentation. If an index is created, well and good. If not, create one. Please do not ask for such easily accessible information before trying your best.

ADD REPLY • link 5.2 years ago by Ram 43k

1

Entering edit mode

I'm more of a fan of the ensembl genome versions. In my view, they have better and more thorough annotation and a more well documented release schedule. Just my 2 cents.

ADD REPLY • link 5.2 years ago by colindaven 6.3k