Soft-clipping of reads in Amplicon-sequenced data
3
0
Entering edit mode
6.7 years ago

I have a variant calling pipeline which I use to process amplicon-sequenced fastq files; it uses cutadapt to remove the adapter sequences on either the 5' or 3' end, then performs alignment with bwa mem.

The user guide for cutadapt states that

And if you use BWA-MEM, the trailing (5’) bases of a read that do not match the reference are soft-clipped, which covers those cases in which an adapter does occur.

And the bam files produced by bwa do show examples of soft-clipped trailing bases. I don't expect this to be an issue for the later stages of variant calling as the trailing bases are soft-clipped and should be disregarded by the variant calling software, but I'm a bit confused by the existence of the soft-clipped regions in the first place. Surely if the data is amplicon-sequenced, then all reads should have adapters, so I wouldn't expect any trailing bases that don't match the reference? Does this mean the adapter sequences I pass to cutadapt are incorrect? Or is this a non-issue?

Here's a link to an example bam track, the top track shows the soft-clipped reads.

alignment soft-clipping amplicon • 3.1k views
ADD COMMENT
4
Entering edit mode
6.7 years ago

It depends on how you construct your sequencing library. If there are different amplicons of varying size, and if your sequencing reads are longer than the amplicon length then you'll sequence into the adapter at the 5' or 3' ends. It sounds like you might be confusing Illumina adapter sequences with the primers you used to amplify your amplicon.

ADD COMMENT
0
Entering edit mode

You're right, I was getting confused. Cutadapt is removing primer sequences, and according to others in the lab we have amplicons of varying size. Thanks for the help.

ADD REPLY
1
Entering edit mode
6.7 years ago

Your primers exactly match the reference so won't give you any softclipped bases. BUT I do think that you should separately mask your primer sequences because those shouldn't be used for variant calling.

ADD COMMENT
0
Entering edit mode
6.7 years ago
ccagg ▴ 60

Also keep in mind that soft clipping can happen when regions with insertions and deletions occur and your software is unable to map them to the genome properly. If you're planning on calling indels later, especially with software that isn't clipping-aware this could affect how well you call them.

See Scapel (2015) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4180789/

and varDict (2016) https://www.ncbi.nlm.nih.gov/pubmed/27060149

ADD COMMENT

Login before adding your answer.

Traffic: 1682 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6