I've got an RNA-Seq project that got more complicated when it became apparent the reference genome was more different from my strains than anticipated. I'm trying to figure out the best (or least bad) sequence to map my reads against, and could use some help, since I am probably overthinking things at this point.
Things I have:
- Two bacterial strains (A and B). B was generated by random chemical mutagenesis of A (number & type of mutations are unknown.) They turn out to be different enough from the type strain C that they may constitute a distinct substrain.
- Timecourse RNA-Seq data for both A & B (single, not paired end). Multiple timepoints per biological sample, several biological replicates.
- Closed, annotated reference genome for the type strain C.
- Incomplete draft genomes for both A & B (Illumina sequencing). However, they're split across too many contigs. There are some low-complexity regions, near-identical gene copies, and other features that complicate assembly, along with some reads from a difficult-to-eradicate contaminant strain. Multiple assembly methods still leave them a bit of a mess.
- Simple variant calling for A & B against C using the reads obtained for the A & B draft genomes.
- It will be hard to convince other people on this project to spend the money to close or significantly improve the A & B drafts via PacBio or something unless it really is necessary. We didn't originally budget for genome sequencing.
Data I want to be able to assemble:
- Differential gene expression within one substrain (A-only or B-only) across timepoints or between A & B at one timepoint (nothing fancy - pairwise DESeq2 comparisons and I may also try some timecourse-focused tools).
- There are genes of particular interest that are directly affected by assembly issues in the current A & B draft genomes. I nevertheless want to make sure that my differential gene expression analyses for them are as decent as possible.
- I want to be reasonably sure about any differences in gene sequences for A & B vs. C, for interpreting paired proteomics data.
Concerns I have about about mapping my A/B RNA-Seq reads against specific sequences:
- If I align the A & B RNA-Seq reads against their respective draft genomes, it seems to me that there may be some gene-specific problems due to genome assembly issues. For example, I know at least one gene of interest is split across two contigs, and there are (known) duplicated genes that may be mis-counted because they show up as 1 tiny contig in the draft.
- If I align the A & B RNA-Seq reads against C, that avoids the draft genome issues seen for A/B, but obviously there will be more mismatches, which may affect analysis of differential expression in a non-uniform way, and which won't catch any areas where the differences may be too large to map reads to C at all.
- If I align the A & B RNA-Seq reads against pseudo-reference sequences, generated by using the A & B variant calling data to modify the C reference genome, I may also run into other issues related to variant calling artifacts.
What's the least-bad way to try to analyze this dataset?
I think the right answer here is to spend the money, by hook or by crook, and just do the genomes for A and B. Hybrid illumina and nanopore sequencing at one of the facilities I know of is only about £120 a genome - surely there's some slack in the budget or slush money floating around to cover that?
Unfortunately, as an external user, pricing for Nanopore or PacBio at facilities I know of is closer to $1k/genome (including library prep) and we'd of course be doing two genomes. I'll take another look and see if I can find any facilities with a price closer to the range you mentioned, since that would certainly be an easier sell.
Perhaps I could also try to split the difference by completing A, taking a look at the existing reads for B against the new A assembly, and seeing whether the mutations are minor enough that I might be able to get away with preparing a pseudo-reference for B based on variant calling?
I was a little off in my estimate for dual illumina and nanopore, but the service I was thinking of was https://microbesng.uk/our-services/
You might only need illumina if you don’t bother to close the genome. In any case, you should be able to do both genomes for well under a thousand.