Biostar Beta. Not for public use.
identifying transgene insertion site in WGS
Entering edit mode
16 months ago
Assa Yeroslaviz ♦ 1.2k

I would like to ask for your opinions. I have a WGS data set of mouse with a transgene inserted into it at an unidentified location causing a specific unexpected phenotype.

We would like to identify the insertion position(s).

I was thinking about trying a de-novo sequencing (SOAPdenovo) but I'm not sure if this is the correct approach. By de-novo sequencing I was hoping of identifying the transcripts containing the insertion site (it is ~6.2mb in size) and identify where it was lodged into the genome (mouse as a reference organism).

Do you think this can be a good solution?

Can anyone recommend a better approach or tool for this kind of analysis?

Entering edit mode
13 months ago
d-cameron ♦ 2.0k

I've had success using GRIDSS to do this. I even included an example of doing this in the GRIDSS paper.

In short you:

  • Add transgene to your mm10 reference
  • (optional depending on transgene sequence) mask (replace with Ns) the mouse homolog of your gene of interest
  • Align reads to mm10+transgene
  • Call SVs (using GRIDSS)
  • Look for SVs to/from your transgene (ignoring those that go to your mouse homolog).

Edit: if you're trying to identify an _unknown_ transgene, then you'll need to do de novo assembly to reconstruct it. It'd still recommend running GRIDSS (v2.2 or later) against mm10 as it will report the insertion site and (~400bp of ) sequence in VCF single breakend notation.

Entering edit mode

Hi Cameron,

I have tried gridss before (it was still version 1.5.1 back then) and have had some good, mixed experience with it. We have got some nice results which showed us a possibility of one specific (or two different, we couldn't quite figure out the results) insertion site(s). Do you think I should try the new version (v. 2.2.0) again? We did exactly what you listed above (merging the genomes, masking the regions in the mouse chromosomes, alignment, SV -> vcf file).

I was thinking the de-novo assembly would give me a more straightforward results. or maybe even using your own tool socrates to look for exactly that.

Entering edit mode

Sorry for the delayed response.

Do you think I should try the new version (v. 2.2.0) again?

I do. V2.0 added single breakend reporting which can be quite helpful in this sort of analysis. Whilst my collaborators supply an expected construct when engaging me, I've yet to have a project where the construct I've been given has been correct. One transgene included a PhiX component that they forgot to tell me about, another sent me the full sequence for the human gene they'd inserted which I then had to trace through all the exon to exon SV to validate it was the correct transcript, and so on.

Although single breakend calls have an intrinsically higher FDR that breakpoint call, they're extremely useful in determining a) whether you're missing bits of your construct, and b) whether you have a insertion site in repetitive sequence.

I was thinking the de-novo assembly would give me a more straightforward results.

You'll still need to do the post-assembly steps of identifying the contigs containing the construct and aligning the contigs back to the reference. If you have multiple insertion sites, this will result in branches in the assembly graph which will split your contigs at the insertion sites thus putting you right back where you started.


Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1