Hi all,
I'm a microbiology postdoc but new to bioinformatics, so could do with a little help. I recently received the output files (fastq) from an RNAseq experiment we ran. The RNA was library prep'd and sequenced by our in house genomics facility, but I wanted to learn how to do the analysis of it. The RNA was from Gram +ve bacteria, and I want to do differential expression. Using FastQC the quality of the data is great, so I'm not concerned about that.
I've been advised to use bowtie2 to align to the reference, so I started with bowtie2-build to create the indexes from the reference .fna file. That seemed fine, and created several files.
Then I used /Volumes/Daniel_HDD/RNAseq/bin/bowtie2-2.4.1/bowtie2 -x /Volumes/Daniel_HDD/RNAseq/index/700669_index -1 /Volumes/Daniel_HDD/RNAseq/WT1_R1_fastq.gz -2 /Volumes/Daniel_HDD/RNAseq/WT1_R2.fastq.gz -S /Volumes/Daniel_HDD/RNAseq/WT1.sam
Now, I previously did a run through using a nature protocols pipeline (Hisat2, then stringtie), and it gave me BAM files, but the bowtie2 method suggests to output to SAM files (which I know are bigger), but even the sam files for the last attempt were not anywhere near this size.. They've been running for a 1.5h and its currently at 19gb for alignment of this first pair. (I'm running a MacBook pro, 2015, 2.2 GHz Quad-Core Intel Core i7, 16gb RAM)
Is this normal? Should the file be this big? The genome of the reference is only 2.2mbp, and there are about 2k genes.. so I wasn't expecting it to be this big a file. Should I have output to a bam file, or compressed, or would that matter for this stage? Is there another pipeline that would be better to follow? Thanks in advance
Size of files is never a good metric. How many reads are you working with? If you have a large dataset then you are likely to get a large SAM file (which is text and uncompressed as compared to your reads which are compressed). You may have secondary alignments (multi-mapping) which would lead to additional lines in SAM file.
I don't recollect if
bowtie2
will output to standard out if you remove the-S
option. If it does then you can directly pipe the output ofbowtie2
intosamtools view | samtools sort
to generate a sorted BAM file directly.Do you have a reference transcriptome?
I downloaded from the NCBI the genome assemblies folder for this strain, and in that are many different files - so one of those could be the transcriptome rather than the genome. They are (among other text files, all preceded by GCF_000026665.1_ASM266v1_) cds_from_genomic.fna.gz, genomic.fna.gz, genomic.gbff.gz, genomic.gff.gz, genomic.gtf.gz, protein.faa.gz ,protein.gpff.gz, rna_from_genomic.fna.gz < could this one be it?, translated_cds.faa.gz
Should I use that to create the index files instead of the genomic one?
Since you are working with bacteria it can be the CDS file. But since you have already built an index from genome reference and there is GTF/GFF file available you can just use one of those files to do read counting with
featureCounts
.