Assembly method validation - am I doing it right?
2
0
Entering edit mode
7.8 years ago
f2369583 ▴ 10

Hi,

I'm in the early stages of preparing my data for publication, and I just wanted to get bioinformatician's opinions on the way I've handled my data to make sure I'm not doing anything wrong and avoid any pitfalls.

Data generated on MiSeq @ 250bp PE, bacterial whole-genomes.

FastQ's assembled using spades with careful flag - look at final assemblies, size of genome, N50.

For files that are larger/smaller than expected I generally use sickle to trim short/low quality reads and reassemble.

Now a few things I'm not clear on:

Do I need to trim adaptors before assembling with spades? The MiSeq trims adapter seqs as part of bcl2fastq, but should I be doing it as a failsafe?

Do I need to order my contigs after assembly? I'm making phylogenies and trying to work out whether any two isolates are the same based on no. of SNPs.

Some of my fastqs have contaminant reads (BLAST identifies non-target organism with high identity). Should I just discard these fastqs or are they salvagable?

Assembly spades bacteria genome • 2.5k views
ADD COMMENT
1
Entering edit mode
7.8 years ago
st.ph.n ★ 2.7k

As @WouterDeCoster said you'll need to trim adaptors. You can use scythe following sickle to trim out adaptors. You can also look into trimmomatic to do both.

Other things you can do:

  1. Ordering your reads may speed up assembly, but it is not required. A script from pRESTO, will synchronize and pair your reads, and throw out unpaired reads. There is an option to keep the paired end reads, which you supply to spades as single reads.If you use sickle/scythe as mentioned above, you will get a singles.fa file. Concatenate both these fasta files, and use as single reads in your command for spades.

  2. After quality and adaptor trimming, you may want to see if your reads can be extended with FLASH, which can increase the quality of your assembly. You'll have extended reads (merged paired end), and not_combined_1/2 from here. You can cat the extended reads to your other singles fa.

  3. I would filter post assembly, by BLAST, using a high e-value/% id. You can use outmt 6, with -max_target_seqs = 1, and pull out the high hitting contigs from your assembly to the database from you assembled fasta.

  4. Are you more interested in the contigs fasta, or scaffolds fasta?

  5. Your downstream analysis for getting SNPs, will depend on mapping your raw reads (use QC'd reads here), back to your assembly. Then you can use the pipeline from GATK best practices, to get a VCF file for comparing SNPs. See 1000 genomes project for more info on VCF files.

You should maybe change your question to read: "Assembly raw read QC". Assembly "validation" would involve more downstream analysis, post-assembly, by looking at gene ontologies, and homology to your closest reference.

ADD COMMENT
0
Entering edit mode
7.8 years ago

Having adapters in your data will be a disaster if you start assembling. So you either check they aren't there (anymore) or you remove them to be sure.

ADD COMMENT

Login before adding your answer.

Traffic: 1678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6