Question

PacBio sequencing for variant calling?

3

Entering edit mode

5.3 years ago

borealis ▴ 30

Hello BioStars!

I am new to variant calling and would appreciate your feedback or any pointers in the right direction/literature recommendations. I have a question about the practicality of using PacBio sequencing to perform variant calling between 2 plants - e.g one is normal phenotype and another one is diseased (allopolyploid, >50% repetitive elements). The disease is a likely result of a disruption in a gene (insertion/deletion) or a transcription factor in a biosynthetic pathway. There is a reference genome available which was assembled using a hybrid assembly including PacBio/Illumina. My initial plan was to do Illumina sequencing (15x coverage) and map to the reference genome. Now there is an opportunity of doing PacBio sequencing and I am debating whether it will be help me with answering my biological question. I was thinking of a preliminary idea of doing PacBio coverage at 10x for closing the gaps from paired end reads and use Illumina 15x data to correct the errors... Thank you for reading.

sequencing • 4.9k views

ADD COMMENT • link updated 5.3 years ago by tjduncan ▴ 280 • written 5.3 years ago by borealis ▴ 30

0

Entering edit mode

While someone else may be along with a proper answer these two things are going to make this a tough task:

plant, allopolyploid, >50% repetitive elements
coverage you expect to have (are you going to do CCS type reads rather than long?)

If you suspect you know the likely cause (gene) then going selectively after that may be better.

ADD REPLY • link 5.3 years ago by GenoMax 141k

0

Entering edit mode

Thank you! Going selectively for the genes involved in the pathway is good advice.

ADD REPLY • link 5.3 years ago by borealis ▴ 30

GenoMax · Answer 1 · 2019-01-14

Hello Borealis !

I think this paper may help you : https://www.nature.com/articles/s41592-018-0001-7 (the free bioarxiv here ;) : https://www.biorxiv.org/content/biorxiv/early/2017/07/28/169557.full.pdf )

It explains how long read sequencing (such as ONT and PacBio methods) can help finding larger structural variant, and it also present their NGMLR + sniffle pipeline to detect structural variant from long read data.

I already tried this method with PacBio data and it worked pretty well for me (it wasn't plant data, but it was a highly diploid drosophila species). But I had a enormous PacBio coverage (like 100X +...) so I don"t know if only 10X would be enough for that analysis !

Bonus : You can then try to visualize your data using Ribbon (http://genomeribbon.com/ ).

Hope this could guide you a little bit ! Don't hesitate to ask for further questions.

Cheers,

Roxane

score 3 · Answer 2 · 2019-01-14

Hi All,

What the others have said is generally accurate. Typically Illumina short read data is great for SNPs and long read data (PacBio or ONT) is typically used (this is changing) primarily for de novo assemblies or structural variation. If you truly have a good reference genome your idea of mapping Illumina data from your samples for SNVs and PacBio ~10x coverage for SV detection may work just fine.

Once you have PacBio data ( ~10x coverage for SV detection) you can use their PBSV structural variant calling pipeline to call SVs. It is available on bioconda (along with other key Pacbio BFX tools) or you can download the entire SMRT Link software suite from their website.

https://github.com/PacificBiosciences/pbbioconda
This is a poster from PAG showing this low fold coverage SV pipeline for several plant species. https://www.pacb.com/wp-content/uploads/Vierra-PAG-2018-Maize_Soy_SV.pdf
https://www.pacb.com/videos/tutorial-structural-variant-calling-smrt-link-v6-0-0/

A really interesting and cutting edge approach is what rrbutlerii mentioned. With the newest Sequel 6.0 chemistry release https://www.pacb.com/products-and-services/sequel-system/latest-system-release/Generating you can generate PacBio ccs data (circular consensus sequencing). Using this method you size select for up to ~10kb fragments and then sequence. Because each unique molecule is sequenced multiples times you get a high quality consensus accuracy for each molecule. With a 10kb fragment more than half of your data will be at a single molecule accuracy of >Q20 (1 in 1000 error). Because you have a good chunk of data that is more like the error profile of illumina data (>Q30 1:10000 error) many of the existent short read variant calling tools (GATK) that were designed for illumina data will work with PacBio ccs data. You can see this approach in the following poster

https://www.pacb.com/wp-content/uploads/Rowell-CSHLBioData-2018-Comprehensive-Variant-Detection-in-a-Human-Genome-with-PacBio-High-Fidelity-Reads.pdf

Building on top of that here is a preprint that was released yesterday that further breaks down this approach. It is pretty cool because you get all types of genomic variation (SVs, SNPs, haplotypes (if relevant to your sampel) and good de novo genomes from only one technology.

https://www.biorxiv.org/content/early/2019/01/13/519025

At the end of they day I am sure you just want meaningful answers to your biological question. At this point in time there is no clear and cut way for identifying all types of genomic variation (SNVs are relatively easy while larger structural events are the wild west) so it is best to jump in with the highest quality data sets you can afford to generate and see what you find.

Hope some of this info helps!

score 2 · Answer 3 · 2019-01-14

tl;dr The question to ask is the type of variation you are expecting to encounter; a non-systematic approach to be sure, but there is not a one size fits all solution. In my personal experience (TruSeq/Nextera Illumina several paired end types de novo and reference based, PacBio RSII p4c2/p5c3/p6c4 de novo, Nanopore R7/R9.4 de novo), I would go with the PacBio (>40x coverage, ideally >80x; 15kb-20kb fragment size) combined with Illumina (10-15x covs). PacBio alone will work for SVs so long as you are not specifically calling SNPs, as homopolymer frameshifts and read accuracy are a potential confounder (not insurmountable, just something to be aware of).

Longer Version: As Roxane said, there are numerous advantages to long reads for structural variation (also see this paper). That said they are noisy reads, so 10x depth will not be enough for majority SNP basecalls, hence the typical hybrid sequencing route. Comparatively, Illumina cannot call the same types of SVs without additional technologies (combining w/ mate pair runs, as in human genomics), however I personally am skeptical of its ability to call things other than large scale insertions, deletions and inversions. Tandem insertions of multiple copies of a gene (as in the ultra-long nanopore read paper above) would fall under this category, and could be relevant to an allopolyploid with differential gene copy number in said pathway.

With PacBio, you can choose your fragment size, and overcome particular obstacles. There was a recent PR blurb by PacBio for Sequel 6.0, basically that their circular consensus reads (CCS) could achieve >99% accuracy in high throughput, the drawback being that is on 1kb and 2kb fragments, I imagine the result of five passes or more of the target sequence in each CCS. With the RSII--even without CCS reads--the majority of basecalling errors we saw were stochastic, so higher depth coverage (>80x) was sufficient for consensus even with homopolymers of 3-6 bases, something that typically stymied 454 sequencing. Larger homopolymers will still give you frameshift trouble, though I haven't tried the Sequel yet, so I don't know if they have gotten better at resolving those.

The last time I spoke with a PacBio rep, their suggestion was to run a library of 2kb for CCS, and a library of 15-20kb for scaffolding (taken with a grain of 'please buy my product' salt). If you have cheap Illumina available to you though, that is still the better option to check individual base accuracy and check homopolymers, though Illumina itself is not immune to that issue. In our experience with bacterial genomes, the PacBio 15k long reads at 80x was sufficient to no longer need the Illumina for checking. Given eukaryotic genome complexity, and the size of typical plant genomes, the hybrid approach is still you best bet. However you should probably scale up the PacBio long read coverage.