Question

Assemble bacterial .fastq files and find differences (SPAdes followed by MUMmer)

0

Entering edit mode

4.6 years ago

LRStar ▴ 200

This is somewhat of a follow-up from a previous post (https://www.biostars.org/p/395813/). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. My assignment is to 1) assemble each strain into its full circle and 2) perform a pairwise comparison between the four strains to discover insertions, deletions, and duplications between them.

Based on suggestions from the previous post, I tentatively have the following pipeline in mind: 1) Run fastqc, 2) Create contigs using SPAdes, 3) Fully align contigs into their "full circle or full linear alignment" using MUMmer and examine for insertions, deletions, and duplications between the files.

The fastq output looks good. So, I ran spades.py -k 21,33,55,77 --careful -1 read1.fastq -2 read2.fastq on each sample. I then used R to examine the contigs.fasta files in the main output directory (I think this means the output from using a kmer of 77). Now, I am really lost trying to determine if the output looks decent (i.e. ready for input to MUMmer for comparison between the sequences). Here were my findings:

1) Across the four samples, the number of contigs ranged between 40 and 100.
2) The 5-number-summary of the contig lengths were as follows:

##     Sample Min    Q1   Med      Q3    Max
## 1 Sample 1  78 155.0 520.5 14542.5 199284
## 2 Sample 2  78 237.5 277.0  8684.5 498771
## 3 Sample 3  78 247.0 363.0 34922.0 364001
## 4 Sample 4  78 240.5 268.0  1889.0 364018

3) The 5-number summary of the contig coverage were as follows:

##     Sample      Min        Q1      Med       Q3  Max
## 1 Sample 1 0.253165 15.477241 19.47476 33.14796  847
## 2 Sample 2 0.415254  0.805012 20.91779 35.29753  771
## 3 Sample 3 0.770833  0.907975 28.25000 33.77679 1120
## 4 Sample 4 0.544141  0.782492  0.96732 46.50450 1262

My first question is: Do any of these values seem worrisome? (i.e. the large range in the number of contigs? or, certain contig coverage being less than 1? or, discrepancies in the contig coverage between samples, especially Sample 4 having a Q3 for contig coverage less than 1? etc.?)

My second question is: related to the MUMmer process (where I hope to identify differences between the sequences). I have read the MUMmer vignette a few times but am still unclear about:

A) Whether a contig.fasta file is appropriate as input (as each contig.fasta file contains 40-100 contigs in my case, rather than an "fully assembled genome" - assuming there is a difference)?

B) Whether it may be better simply to do the entire assembly process in MUMmer? Rather than align the contigs in SPAdes and then import into MUMmer for an additional alignment?

C) How I can determine how long my sequences are? For instance, MUMmer suggests only using run-mummer1 on small sequences (which they define as <10Mbp). Right now, all I have are contigs. Is it possible for me to know the sequence lengths for the 4 samples without them being fully aligned (assuming a set of contigs is not "full alignment")?

D) How I can determine how similar the sequences are? This can help me decide whether to use NUCmer, PROmer, run-mummer1, or run-mummer3. For now, as a default, I plan to simply run all.

Suggestions to any of these topics would be greatly appreciated. I am having difficult moving forward possibly due to preconceived misunderstandings about the difference between a set of contigs and a fully-aligned genome, a lack of me finding resources demonstrating the transition between SPAdes to MUMmer, and a lack of my understanding of determining the degree of similarity between these contig files.

contig bacteria spades mummer SNP • 2.4k views

ADD COMMENT • link updated 4.6 years ago by Joe 21k • written 4.6 years ago by LRStar ▴ 200

score 3 · Accepted Answer · 2019-09-04

1)

Higher numbers of contigs from short read sequencing is to be expected. SPAdes automatically optimised the kmer size to give you what it considers the best assemblies. 40-100 contigs doesn't sound outlandish to me. I'm guessing if this is assignment data, they've given you reasonably solid FASTQ data.

If your coverage numbers are based on what SPAdes puts in the fasta headers, that isn't sequence coverage, it's kmer coverage, so a low number is not necessarily worrisome.

My advice would be to ditch R and use some proper sequencing QC programs like Qualimap and Quast. You will need other indicative numbers such as the N50 and average length etc, to say much at all.

2)

A) A contigs.fasta should be an appropriate input for an aligner like mummer, however the alignments might be trash without first sorting the contigs by order. For this, I would suggest finding a reference closed genome (if you know what the organism for the fastqs is), and using progessiveMauve to reorder the contigs. You can then concatenate them and feed this in to mummer. Alternatively (and possibly better), is to use mummer to align each of your query genomes against the reference, and call variants that way (which could be considerably more accurate and easy to interpret, assuming you can indeed get such a reference). Ideally, you'd have perfectly complete and closed data, but thats not an option with short reads alone.

From the mummer webpage:

MUMmer can also align incomplete genomes; it can easily handle the 100s or 1000s of contigs from a shotgun sequencing project, and will align them to another set of contigs or a genome using the NUCmer program included with the system.

B) Unless I'm forgetting something about mummer you couldn't do the whole process in mummer even if you wanted to. SPAdes is an assembler, not an aligner. Running mummer on short reads would achieve nothing really.

C) Quast should give you sequence lengths etc, but you can calculate this on a contig by contig basis and sum the results easily in the shell. For a bacteria, its highly unlikely you'll have >10Mb, so I wouldn't worry about that. The number you get wont be perfect, as some of the contigs in your file will likely be junk, but it should get you a number to within a few hundred thousand base pairs.

D) You need to clarify what you mean by "different strains of bacteria" to answer this. Are these literally strains of the same species? If so, they are likely close enough for nucmer to give good results. If they are different species, you may need to experiment; start with nucmer and work upwards. I'm unconvinced about alignment from 6-frame translations as I don't know how good mummer's gene predictor is and this could give a lot of false positives.

Finally, you will probably want to consider re-mapping your reads to your assemblies with bwa/bowtie or similar, then calling variants with VarScan or BreSeq (if they're sufficiently close you could do this with a reference sequence, and avoid the need for assembly etc at all). This will check all of the reads for each assembly and see if there are variations versus the called base in the final assembly.