Hi everyone,
I'm a new user and have been trying to reanalyze a zebrafish RNA seq data from another study which are available in fastq files (SRA tools were used). However, I've been stuck on the use of tophat for a while and I cannot move forward. I'm running tophat through:
module load bowtie/0.12.8
module load samtools/0.1.18
module load tophat/2.0.3
tophat --no-coverage-search -G /data/genes.gtf -o /data/tophat_output/test4 /data/Genome/BowtieIndex/genome /data/fastq/nSC1.fastq > /data/tophat_output/test4/test4-nSC1.log
I used those versions of software as I don't know which versions of samtools tophat and bowtie work well together. The problem is I don't get any accepted hits (ie, accepted.hits.bam is empty) and all the reads goes to unmapped.bam file.
In the tophat logfile i get:
[2018-09-14 12:52:16] Checking for Bowtie
Bowtie 2 not found, checking for older version..
Bowtie version: 0.12.8.0
[2018-09-14 12:52:16] Checking for Samtools
Samtools version: 0.1.18.0
[2018-09-14 12:52:16] Checking for Bowtie index files
[2018-09-14 12:52:16] Checking for reference FASTA file
[2018-09-14 12:52:16] Generating SAM header for /lustre/beagle2/navbav/hearing-transcriptomics/zebrafish_2018_Batra/Genome/BowtieIndex/genome
format: fastq
quality scale: phred33 (default)
[2018-09-14 12:52:21] Reading known junctions from GTF file
[2018-09-14 12:52:24] Preparing reads
left reads: min. length=35, max. length=101, 54560412 kept reads (284568 discarded)
[2018-09-14 13:24:16] Creating transcriptome data files..
[2018-09-14 13:25:49] Building Bowtie index from genes.fa
[FAILED]
Error: Couldn't build bowtie index with err = -11
I searched about this error and seems to be a compatibility issue between the names of the genes in the GTF file and the Bowtie index files. But given I used iGENOME for my reference genome and GTF (annotation) file, am I wrong to think that they should be matched already ? I used the UCSD format btw.
Nevertheless, I checked the gene names in the GTF file and the reference genome file were matched to the best that I could check, which are as following:
For Ref file:
> bowtie-inspect --names genome
chr10
chr11
chr12
chr13
chr14
chr15
chr16
chr17
chr18
chr19
chr1
.
.
for GTF file
> head -5 genes.gtf
chr1 unknown exon 6642 6760 . - . gene_id "rpl24"; gene_name "rpl24"; p_id "P5276"; transcript_id "NM_173235"; tss_id "TSS1927";
chr1 unknown stop_codon 6680 6682 . - . gene_id "rpl24"; gene_name "rpl24"; p_id "P5276"; transcript_id "NM_173235"; tss_id "TSS1927";
chr1 unknown CDS 6683 6760 . - 0 gene_id "rpl24"; gene_name "rpl24"; p_id "P5276"; transcript_id "NM_173235"; tss_id "TSS1927";
chr1 unknown CDS 6892 6955 . - 1 gene_id "rpl24"; gene_name "rpl24"; p_id "P5276"; transcript_id "NM_173235"; tss_id "TSS1927";
chr1 unknown exon 6892 6955 . - . gene_id "rpl24"; gene_name "rpl24"; p_id "P5276"; transcript_id "NM_173235"; tss_id "TSS1927";
I was wondering if someone have any advice for me?
Much appreciated in advance. Navid
Even though you are trying to reanalyze old data there is no reason to use TopHat for this. Use STAR, HISAT2, BBMap instead.
That said you need to pay attention to this part from TopHat Manual when you want to make transcriptome specific index. This is a special one time run that you do before you align the actual data.
Thanks a lot for the guidelines, I'll go through all the comments, as suggested.
Tophat has been retired in favour of HISAT, and the versions you are using for all programs are really old. Unless you want to try and replicate exactly what has been done in a paper, use HISAT or STAR (any one of them replaces Bowtie+Tophat, with greater accuracy and speed) for mapping, and SAMtools 1.X - current version is 1.9 - to manipulate SAM files.
edit: trying to replicate published analyses never completely succeeds, so don't be surprised when, down the line, your results differ from the published ones.
Thank you, I'll consider using other packages as well provided I can find step by step tutorials for them.
Cheers, N