How to deal with reads which CIGAR is [0-9]+S
1
0
Entering edit mode
6 months ago
ManuelDB ▴ 80

I am developing an NGS Bioinformatics pipeline to analyze somatic amplicon data. After mapping my reads, I soft-clip both ends of my reads because these reagions are the primers and I dont want them to be used for the variant calling. I soft-clip these regions with this simple code

 apptainer exec "$samtools" samtools  ampliconclip \
               --strand \
               --soft-clip \
               --fail \
               -b "$primercoordinatesbed" \
               "${Sample_ID}"_mapped_sorted_IRealigned.bam > "${Sample_ID}"_CLIPPED.bam

After analyzing many different samples, I have realised that occasionally, a small number of reads show a soft clip that covers the entire read

 MN01972:56:000H5KYM3:1:11102:21686:4925 256     chrX    124025823       0       150S    *       0       0       ATATAAATTTCTAAGGAAGGGTTTTAATTCGCGAATTTGACGATCGTTGCATTAACTCGCGAACCCAGCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAATGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG  AFFFFFFFFFFF6FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF/FFFFFFF66//FFF/FFAAFFF=A/FFAFFF/FF/FFF////AFFAFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  SA:Z:chr2,32916355,+,104S46M,0,0;       RG:Z:230926_MN01972_0056_A000H5KYM3_RACP2-6poolv6-P5-A  UQ:i:0  AS:i:30

(I believe this happens because the reads came with a large soft clip in the middle and the two additional soft clips finally completely softcliped the entire reads)

This is very unlikely BUT This returned an error in PISCES (the variant calling I use) and to solve this issue I developed this code to unmapped those reads

 apptainer exec "$samtools" samtools view -h "${Sample_ID}"_CLIPPED.bam \
| awk 'BEGIN {OFS="\t"; total=0; changed=0} $1 ~ /^@/ {print; next} {total++} $6 !~ /^[0-9]+S$/ {print} $6 ~ /^[0-9]+S$/ {$2 = 512; $5 = 0; changed++; print} END {print "Total reads: " total > "/dev/stderr"; print "Changed: " changed > "/dev/stderr"}' \
| apptainer exec "$samtools" samtools view -bS - > "${Sample_ID}"_FINAL.bam

This overcome the issue BUT The problem comes here.

I am improving my code and I have added ValidateSamFile before the PISCES to check my BAM file. This is more pedantic than PISCES and I get so far the following error

 ERROR::INVALID_CIGAR:Record 1285596, Read name MN01972:56:000H5KYM3:1:11102:21686:4925, No real operator (M|I|D|N) in CIGAR

Is there a proper way to deal with these types of errors? I am wondering if I could use PYSAM instead of that horrible awk command to somehow null this problematic reads so that I can pass the validation checking without adding IGNORE=INVALID_CIGAR in the ValidateSamFile

I never use PYSAM before. Any advice in any direction will be welcome.

NOTE: To delete those ^[0-9]+S reads is not a solution as my supervisor doesn't like this approach. I need something alternative to this.

CIGAR ampliconclip • 542 views
ADD COMMENT
0
Entering edit mode

Maybe you can trim read ends with fixed length in fastq file before alignment?

ADD REPLY
1
Entering edit mode
6 months ago

using samjdk ( https://jvarkit.readthedocs.io/en/latest/SamJdk/ ) and the expression below, you can set the unmapped flag if all the cigar is clipped:

 java -jar jvarkit.jar samjdk -e 'if(record.getReadUnmappedFlag()) return true;final Cigar cigar = record.getCigar(); if(cigar.getCigarElements().stream().map(C->C.getOperator()).allMatch(OP->OP.isClipping())) {record.setReadUnmappedFlag(true);record.setProperPairFlag(false);} return record;' 2> /dev/null
ADD COMMENT
0
Entering edit mode

I will check this when I get permission from my manager. But this looks very promising.

ADD REPLY

Login before adding your answer.

Traffic: 1766 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6