Hello, I am trying to sequence pichia genome with a transgene insertion. I have paired end reads from Illumina miniseq.
I tried using BWA with wild type genome as reference followed by samtools mpileup and vcfutils to get a consensus sequence. However, in that approach, my insertion vector was not a part of the final assembly. I aligned the reads against the my vector sequence and saw over 40,000 reads aligning to the vector sequence so I am sure that my insertion vector is present in the genome.
I then used Abyss to do a denovo assembly and used the scaffolds generated from abyss to align against the wild type genome and again I ended up with a final assembly with no vector sequence.
Is there a way to get a final assembly with four chromosomes (Pichia pastoris) and my vector sequences present?
Thanks, Shash
Seems odd to me. Given the coverage you said your transgene had when you mapped using it as reference, it should have been assembled. Did you try blasting the transgene against the abyss assembly?
Yes I filtered the reads that aligned to my vector sequence and then I aligned those reads against the contigs generated by Abyss and they all align. My issue is that when I take those contig sequences and align them against wild type genome as reference using BWA, my vector disappears in the final assembly
Of course the vector "disappears", the reference genome doesn't contain it and my guess is it is soft clipped.
You are not explaining in depth what you are doing, nor are you providing the commands used - details matter a lot here.
Okay here is what I am trying to do. I am trying to get a fully annotated genome of my strain and since Pichia is already fully sequenced an annotated, I was hoping to leverage that data instead of doing it from scratch. Here is what I have done so far:
#assemble reads with abyss
abyss-pe name=pp1 k=64 in='reads1.fq reads2.fq'
This yields ~200 contigs. Now instead of finding ORFs and annotating them, I figured that I can use BWA to align these contigs to Pichia pastoris genome and them manually annotate my vector sequences. And this will also allow me to find locations and copy numbers of my gene.
#use bwa to align to GS115 strain (fully sequenced from NCBI)
bwa index reference.fa
bwa mem reference.fa contigs.fa | samtools sort -o output.bam
samtools index output.bam
Then I aligned my reads to my insertion vector using bwa and got the mapped reads using samtools. Then I aligned those mapped reads against the assembled genome using bwa but only very few reads aligned. I am guessing these are the soft clipped reads and therefore, it seems like my vector is not a part of the final assembly.
And my question is that is there a better way to get the final genome sequence contacting the vector?
Thanks
You could map your reads against the available genome (which doesn't contain the transgene), then just assemble the non-mapping reads de novo.