hg19 exome reference and index
1
0
Entering edit mode
6.2 years ago

What is best source for obtaining the hg19 exome reference and its index? This reference will be used with BWA. Do I need to build the index if I’m using a specific version of BWA ? Thank you!

hg19 exome reference index • 5.7k views
ADD COMMENT
3
Entering edit mode
6.2 years ago

You should always align DNA-seq data to the entire genome. For hg19, download the hg19.2bit file from here: http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/

Then, convert it to FASTA format with twobittofa: http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/bigZips/

Be aware, also, that GRCh38 / hg38 is the latest release of the human genome reference. hg19 has 'issues': A: Alternate nucleotide is more frequent than reference nucleotide. OMG I'm dizzy. (as does hg38...)

ADD COMMENT
0
Entering edit mode

Request for clarification.

I was able to successfully run ./twoBitToFa hg19.2bit hg19.fa and ensured hg19.fa was generated. I need both an index as well as the actual reference sequence. What is the best way to proceed forward so that both the reference sequence and associated index are generated?

Thank you.

ADD REPLY
1
Entering edit mode

Hello. To index the FASTA genome reference with bwa, you should use the bwa index command, for example:

bwa index hg19.fa

It will produce a few different files, each of which you will not have to directly reference again provided they are kept in the same directory as your FASTA reference file.

Then, I would use bwa mem for the alignment if your reads are >70bp in length. For shorter reads, you should be using one of the previous bwa algorithms (like we used to do...) or using something like bowtie, which are more tailoured for shorter reads. For example:

bwa mem ReferenceGenomes/hg19/hg19.fasta M1.fastq M2.fastq > Aligned.sam

Prior to alignment, you may consider performing some QC of your reads and 'trimming' in order to eliminate junk that would not have otherwise aligned or that could result in false variant calls further down the line due to low quality bases. For a full idea of pipeline involving trimming, alignment (bwa), generation of QC metrics, and then variant calling (mostly using tools coming from the Wellcome Trust Sanger Inst. in the UK and not Broad Inst), take a look at my GitHub pipeline: https://github.com/kevinblighe/ClinicalGradeDNAseq (in particular, you may look at AnalysisMasterVersion1.sh for the code).

Kevin

ADD REPLY

Login before adding your answer.

Traffic: 2026 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6