Question

Looking for structural variants with paired-end sequencing data

0

Entering edit mode

7.7 years ago

ThePresident ▴ 180

Hello,

I would like to identify genomic structural variants, specifically small genomic inversions in bacterial genome of a relatively small size (2 Mb), by using paired-end Illumina sequencing data.

This paper did something interesting but is impossible to replicate because they use custom MATLAB scripts and they're not available to the public... But the principle is interesting. Basically, they claim that plotting the gap size (the calculated genomic distance between the pairs of reads) against their genomic location will produce a distinct pattern. Normally, the vast majority of reads will have a similar gap size (producing what they call a ribbon), but if there is a genomic rearrangement, there will be a deviation from that pattern (what they call a funnel). See image below.

My question is: how to extract from a SAM file the gap size and the genomic position of the reads so I can try and plot them?

Thank you in advance,

TP

A figure from Goldberg et al.

Illumina paired-end • 1.8k views

ADD COMMENT • link updated 7.7 years ago by zwdzwd ▴ 120 • written 7.7 years ago by ThePresident ▴ 180

0

Entering edit mode

I would suggest to just email the authors of the paper and ask for the script.

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

0

Entering edit mode

Haven't heard from them yet. However, I think this would be pretty easy to do for someone who knows how to code (I'm thinking python?). If I have time I might try it, but I'm no bioinformatician, so...

ADD REPLY • link 7.7 years ago by ThePresident ▴ 180

score 0 · Answer 1 · 2016-07-28

0

Entering edit mode

7.7 years ago

zwdzwd ▴ 120

You probably need the TLEN field for insert size (gap size plus read length) and RNAME/POS for genomic position of reads. see https://samtools.github.io/hts-specs/SAMv1.pdf

Btw, the idea of using abnormal insert size (discordant read pairs) to infer structural variation has been around since the dawn of paired end NGS

see these reviews

http://www.nature.com/nrg/journal/v12/n5/fig_tab/nrg2958_F2.html

http://www.ncbi.nlm.nih.gov/pubmed/24405614

http://www.sciencedirect.com/science/article/pii/S2210776213001579

and following tools are notably popular (most of them cover inversion)

DELLY http://www.ncbi.nlm.nih.gov/pubmed/22962449

BreakDancer http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3661775/

LUMPY http://www.ncbi.nlm.nih.gov/pubmed/24970577

GASVPro https://genomebiology.biomedcentral.com/articles/10.1186/gb-2012-13-3-r22

Meerkat http://compbio.med.harvard.edu/Meerkat/

ADD COMMENT • link 7.7 years ago by zwdzwd ▴ 120

0

Entering edit mode

Thanks, I'll play around with my SAM files and try to extract TLEN and RNAME/POS fields. Also, all these tools that you mentioned are designed and tested on human genomes / cancer cell lines. I haven't try them, but I suspect they might perform less well with bacterial genomes (although, being smaller and less complex, analysis of bacterial genomes might be more simple).

ADD REPLY • link 7.7 years ago by ThePresident ▴ 180

0

Entering edit mode

You are right. Most of these tools are best tested on human genome. It's hard to say whether SV on smaller genome is easier to tackle, since most bacterial genomes are circular (and small), boundary cases such as discordant read pairs striding the start/end position might occur more often.

ADD REPLY • link 7.7 years ago by zwdzwd ▴ 120

0

Entering edit mode

EDIT: Actually, when I look into my SAM files, there is a 0 value for the TLEN. It seems to be an optional field but I just can't figure out how to add it during bowtie alignment. Not sure I understand bowtie manual...

EDIT2: I was looking at the wrong file (with single-end alignment), my bad.

ADD REPLY • link 7.7 years ago by ThePresident ▴ 180