Hello everyone !
I'm looking for a way to asses the quality my long PacBio reads assembly. I already tried Quast, which gives you all the classical metrics for genome assembly assessment. I also tried Busco2, which give you a preview of the duplication level within a set of particular genes (exemple : arthropoda, insecta...), but I need something more specific.
I my long read assembly, as the studied species is higly polymorphic, I already spotted on particular gene local duplication within the assembly. For example, a gene, with the following exon structure : 1 2 3 4 in the closest species D.melanogaster, will be duplicated with a false exon structure : 1 3 4 2 3 4.
I've heard about amosvalidate pipeline, which identify "suspicious" regions. But I'm not sure that this pipeline is well suited for my data.
Do you have any idea of what tool or method I can use in order to detect this particular kind of events ?
Thanks for your advices !
Cheers,
Roxane
I dont understand very well what is you`re looking for, but if you want to found polymorphic regions you can start aligning your reads to the assembly (Bowtie2) and searching for low coverage regions within the contigs (IGV-viewer). You can also predict genes on your assembly and load it in IGV (as GTF file) for more visual inspection, and/or count reads mapped on each gene with HT-SEQ.
Hi Calejas,
As I'm doing long reads assembly from PacBio, there is no (or at least less than Illumina sequencing) low coverage regions. My problem is that I want to detect, within my assembly, smale-scale rearrangments that are chimeric. The specific example I used was verified : in the assembly, the exon structure for a gene was false, they were duplicated regions that were inserted close to each others. It's these kind of errors within my assembly I want to detect.
I thought that amosvalidate will detect this kind of issue but I'm not sure if it's appropriated for PacBio long reads, as it was designed for Illumina reads at first.
Oh, I
m sorry that
s true, In my opinion (i´m not an expert) you have to define whats is the origin of the "complexity" that causes chimeric assemblies; %GC enrichment? kmers overexpressed? And also, what do you have been sequenced? Entire genome? metagenome? or amplicons? If you have been sequenced amplicons I think that analysis can be more easy, Have you tried find_motif from biopieces (assembly)? or fastqc (it includes some interesting graphs (reads)?Yeah, that's true that it could be usefull with we known why theses regions in particular were duplicated. I've sequenced the whole genome, my project is a de novo assembly of a highly dimorphic drosophila : D.suzukii.
Maybe I can try to analyze some particular pieces of my assembly that I think are werid with FastQC indeed, maybe we can extract information from their... I hope so !
Oh! What about comparing your assembly to a very close reference genome? Use nucmer, it think it will be usefull.
Maybe I can try that too ! My main problem is that the reference I will use, drosophila melanogatser, has a smaller genome than suzukii...
I'll see where it goes ! Thanks for the advices