Question

Solve SAM issues flagged by Picard's ValidateSamFile

0

Entering edit mode

5.7 years ago

marongiu.luigi ▴ 710

Dear all,

I aligned fastq pairs to the human genome with BWA MEM, including the read group flag. The fastq files were trimmed for quality with

trimmomatic.sh PE -phred33 -threads 8 <mate1.fq.gz> <mate2.fq.gz> \
<mate1_trimPaired.fq.gz> <mate1_trimUnpaired.fq.gz> \
<mate2_trimPaired.fq.gz> <mate2_trimUnpaired.fq.gz> \
LEADING:13 TRAILING:13 SLIDINGWINDOW:7:30 MINLEN:30

Since each sample had multiple files, derived from the same libraries and lanes, I pooled them with

cat <mate1_trimPaired.fq.gz> ... > <mate1.fq.gz>

then aligned and converted with

bwa mem -t 8 -R <read.group> <ref_file> <mate1.fq.gz> <mate2.fq.gz> > aln.sam
samtools view -Sb <aln.sam > > <aln.bam>

and eventually ran

java -jar picard.jar ValidateSamFile \
INPUT=<aln.bam> MODE=SUMMARY > <report.txt>

The different samples show the same errors:

ERROR:MATES_ARE_SAME_END
ERROR:MATE_NOT_FOUND
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND
ERROR:MISMATCH_FLAG_MATE_UNMAPPED
ERROR:MISMATCH_MATE_ALIGNMENT_START
ERROR:MISMATCH_MATE_REF_INDEX

The problem is now, as you might expect, how to solve these errors? I would be happy even to simply drop the bad reads.

Thank you

Picard SAM • 4.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 5.7 years ago by marongiu.luigi ▴ 710

3

Entering edit mode

You likely concatenated the wrong files or in the wrong order. Concatenate the original files before you do any trimming. As an aside, if you're going to use a local aligner like bwa mem you can simply skip trimming.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

OK, I'll skip trimming. For order you mean lane 1 after lane 2 etc?

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Lane order should not matter as long as you used the same order for the R1/R2 files (assuming data is only from one flowcell). e.g.

cat Sample_L002_R1.fq Sample_L001_R1.fq > Sample_fnl_R1.fq
cat Sample_L002_R2.fq Sample_L001_R2.fq > Sample_fnl_R2.fq

BTW: Don't use word mate unless you have the right kind of libraries. Mate-pair libraries mean something different compared to regular libraries sequenced as paired-end.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

I have aligned the concatenated non-trimmed files, but I got the same errors:

## HISTOGRAM    java.lang.String
Error Type  Count
ERROR:MATES_ARE_SAME_END    2636751
ERROR:MATE_NOT_FOUND    4997955
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 3218552
ERROR:MISMATCH_FLAG_MATE_UNMAPPED   150004
ERROR:MISMATCH_MATE_ALIGNMENT_START 5215849
ERROR:MISMATCH_MATE_REF_INDEX   3477327

So the question still stand: is there a way to remove these flags?

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

It is possible that your original files could be out of sync as far as reads are concerned. Have you examined few entries at top of both files to ensure that the fastq headers are matching? Run them through repair.sh from BBTools to see if you can get them back in sync if they are not.
As I recall one needs to reverse complement one of the reads before alignment, if these are really mate-pair libraries. Have you looked into this?

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

Now that you mention it, they don't look quite the same to me. Here are some of the lines I got from the samples:

$ zcat N-1/N-1_1.fq.gz | head
@HISEQ:90:C3UNJACXX:1:1101:1222:2116 1:N:0:
NCTTCATTGGAGTGGTGACCAATCGTCTTTGCAATCCAGCTGCAACTTCCACTGTCATCTGAAATGGTGTGGATCTGTGTCCTTACCCAAAACATATGTCG
+
#1=DFFFFHHGHFHIGHIJJJIJJJIJJJJIJJJJJJIGJJJJJGJIJJJFHFFGHHJJIJJJJGGICEEHIEFHACCCDDFEFCDCCEBBDDDACDDBDB
@HISEQ:90:C3UNJACXX:1:1101:1201:2121 1:N:0:
AATGTCTTCTTTTGAGAAGTATTTGTGGAGTTTTGGAAAGTTGCAGACTAGCAGAAACTTCATAGCAAAGAGCATAAAAGTCATCAGTCAGTGACTAGGTA
+
CCCFFFFFHHHHHJIJHIIFHIJJJHHJHIHHIJIJJJJJHHIIIGJJJJJJJJJJJJIJJJJJJIJJJIJJJIHHHHHH@DFFFFEEEEEEEDDDDDCCC
@HISEQ:90:C3UNJACXX:1:1101:1246:2124 1:N:0:
AATTATTAAGAAATGACAGTGGTGTATACCCATTCTGAAATGTTTTTTGGAGAAGAGATGTGTAGGATCAGGTAGGAGAAATTAATAGTTTACTGGTTCAA

$ zcat N-2/N-2_1.fq.gz | head
@DHGDKXP1:247:C3MF6ACXX:2:1101:1187:1896 1:N:0:
NGCCGATATCACGCCACTGCACTCCAGTCNGGGAGGTTGCACTGAGCTGAGATCATGCCACTGCACTCCAGCCTGGGCAACAGAGTGANANTCTGTCTCAA
+
#1:DD7DDAHH>BEF?FHGHHIGIE>BDF#10)))8?D;FHIICDDF@3)=CCA>C;DEHFHHE@BDE@@CCCC2=?=5?@9@CCCCC#+#++8@BC>C43
@DHGDKXP1:247:C3MF6ACXX:2:1101:1228:1898 1:N:0:
NGGAGTGGATGAGGCTTCCAGGGAAGCCCNAGTCCCTGAGTGTGGGCTCGCGACTGTCTCACTGGTTAGCTTGTTGACAGTCACTCCANCNCTGCCAGACA
+
#1=BDDDDHHFHFEHHCFBH<B<FEH=BG#)1:?DDFE>GHHAE;@B@=FGI'9<AAACC;A@@@>C595:@CBC953>C>:>>@A?C#+#++28(+:A??
@DHGDKXP1:247:C3MF6ACXX:2:1101:1204:1908 1:N:0:
NATGCTCCACAAAGTCAGAGACACTTAGTNTCATTGTCACATTTTCAAAGCCAAGAACCTATTATGTCTGGGGTATAGTATGCATTTANTNTTGTTTGATT

$ zcat N-3/N-3_1.fq.gz | head
@HISEQ:90:C3UNJACXX:3:1101:1125:2095 1:N:0:
NGCTGTGGCTGAGCTCCCACACAGAGGCAGTGAGTACCGACTTGCCAACTGCGTGTTAGCTAGCCTGGAGGTGGACCTTCCAGCCACTGAATGTCCCTGTG
+
#1=DFDDFHHHHHJJJIJJJIIJIIJJIIJFHGI<BHIIIJJJJJJJJIIHGIAGHEHHHFFFFFFEDEED?BDDCDDCDDDDDDDDCCCCCCCDEDCD:A
@HISEQ:90:C3UNJACXX:3:1101:1226:2105 1:N:0:
NTGGCAGGGTTCTGGCAGCTGCTGTGCCCGAGGAGGGCTCCAATGCAGAGGAGTAGCACCAGGGCCTGCATCCTGGGGCCTGCAAGAAAAGAACGAGGCAG
+
#1+AD7DDFA?FFEFFFIIEA@FEFFIIIEIIIBEFFFFFFFFFIIIAFFEFF=?@BDD?;?CCBBBBBBA>::>>8=<>?@BBBBBBBBBBBBB>>BBB?
@HISEQ:90:C3UNJACXX:3:1101:1110:2109 1:N:0:
NGCCTGCTCCCACAAGAGAGGAGCAACTGGCCGTTCCAGCAATCGTTTGTTCCAGTCTCTACAATTGCTTGCTTGCATGTTCTCTCTCGTGAGGAGTGGCC

I was told these where from the same patient and made with the same library. Could it be an issue of lane? Shall I merge only the reads from the same run but in different lanes? Can I merge all reads form the same patient made on different runs but with the same library? Shall I align each lane separately and then somehow merge the alignments? What is the best approach?

ADD REPLY • link updated 5.7 years ago by GenoMax 141k • written 5.7 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

I was referring to checking that order of reads in N-1_1.fq.gz is the same as in N-1_2.fq.gz. So in such as case the first fastq header from R1 file should be

@HISEQ:90:C3UNJACXX:1:1101:1222:2116 1:N:0:

matching with first one in R2 file (Note 2 before :N:0:)

@HISEQ:90:C3UNJACXX:1:1101:1222:2116 2:N:0:

You are looking at R1 files from three lanes(?) above. They will be different for that reason alone.

If this is the same sample run on three lanes then you can easily do cat N-1_1.fq.gz N-2_1.fq.gz N-3_1.fq.gz > final_1.fq.gz (and similar for R2 files). You do need to make sure that the order of reads in all three files is identical.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

In that case, they look synchronized:

$ zcat N-1/N-1_1.fq.gz | head
@HISEQ:90:C3UNJACXX:1:1101:1222:2116 1:N:0:
@HISEQ:90:C3UNJACXX:1:1101:1201:2121 1:N:0:
@HISEQ:90:C3UNJACXX:1:1101:1246:2124 1:N:0:
$ zcat N-1/N-1_2.fq.gz | head
@HISEQ:90:C3UNJACXX:1:1101:1222:2116 2:N:0:
@HISEQ:90:C3UNJACXX:1:1101:1201:2121 2:N:0:
@HISEQ:90:C3UNJACXX:1:1101:1246:2124 2:N:0:

$ zcat N-2/N-2_1.fq.gz | head
@DHGDKXP1:247:C3MF6ACXX:2:1101:1187:1896 1:N:0:
@DHGDKXP1:247:C3MF6ACXX:2:1101:1228:1898 1:N:0:
@DHGDKXP1:247:C3MF6ACXX:2:1101:1204:1908 1:N:0:
$ zcat N-2/N-2_2.fq.gz | head
@DHGDKXP1:247:C3MF6ACXX:2:1101:1187:1896 2:N:0:
@DHGDKXP1:247:C3MF6ACXX:2:1101:1228:1898 2:N:0:
@DHGDKXP1:247:C3MF6ACXX:2:1101:1204:1908 2:N:0:

$ zcat N-3/N-3_1.fq.gz | head
@HISEQ:90:C3UNJACXX:3:1101:1125:2095 1:N:0:
@HISEQ:90:C3UNJACXX:3:1101:1226:2105 1:N:0:
@HISEQ:90:C3UNJACXX:3:1101:1110:2109 1:N:0:
$ zcat N-3/N-3_2.fq.gz | head
@HISEQ:90:C3UNJACXX:3:1101:1125:2095 2:N:0:
@HISEQ:90:C3UNJACXX:3:1101:1226:2105 2:N:0:
@HISEQ:90:C3UNJACXX:3:1101:1110:2109 2:N:0:

the cat N-1_1.fq.gz N-2_1.fq.gz etc you indicate is exacty what I have done.

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

In that case only thing left is to check on is mate-pair library processing requirements. If your sample data is truly mate-pair and was done using an Illumina kit then check out the data-processing requirements in this Illumina white paper.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

If you're really using the commands as described above and not mucking up the order in which you concatenate files then I haven't a clue why picard is complaining so much. Having said that, picard complains about a lot of perfectly valid things.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

1

Entering edit mode

Note that you can shorten, simplify, also sort the alignments and avoid intermediate files in the alignment step by replacing

bwa mem -t 8 -R <read.group> <ref_file> <mate1.fq.gz> <mate2.fq.gz> > aln.sam
samtools view -Sb <aln.sam > > <aln.bam>

with

bwa mem -t 8 -R <read.group> <ref_file> <mate1.fq.gz> <mate2.fq.gz> | samtools sort -o <aln.bam> -

ADD REPLY • link updated 5.7 years ago by finswimmer 16k • written 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

yes, I know but I prefer to keep the original unsorted BAM just in case...

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

1

Entering edit mode

just in case...

Just in case of what?
You still have the fastq as a back-up, and your sorted bam contains the same information as unsorted but takes less hard drive space.

ADD REPLY • link 5.7 years ago by WouterDeCoster 47k

0

Entering edit mode

from this post, I found that to recreate the fastq files I need the non sorted BAMs, thus I now keep the unsorted 'just in case' I need to re-generate fastq for subsets of the alignment

ADD REPLY • link 5.7 years ago by marongiu.luigi ▴ 710

4

Entering edit mode

Keep the fastq files instead, they take up less space and you'll need to submit them to ENA or a similar site if you ever want to publish this.

ADD REPLY • link 5.7 years ago by Devon Ryan 104k

0

Entering edit mode

You can always sort a BAM by name to restore the initial fastq order.

ADD REPLY • link 5.6 years ago by ATpoint 81k

0

Entering edit mode

I have found that the main cause of error is having not sorted files. I tested some sorted (srt) and not sorted files in pairs and the errors come out only with the latter (absence of flags means no errors):

$ 501N_alnHUM.bam
Error Type  Count
ERROR:MATES_ARE_SAME_END    1578650
ERROR:MATE_NOT_FOUND    2687070
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 1766042
ERROR:MISMATCH_FLAG_MATE_UNMAPPED   95516
ERROR:MISMATCH_MATE_ALIGNMENT_START 3149591
ERROR:MISMATCH_MATE_REF_INDEX   2327647
$ 501N_alnHUMsrt.bam

$ 501T_alnHUM.bam
ERROR:MATES_ARE_SAME_END    1368397
ERROR:MATE_NOT_FOUND    2348094
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 1576968
ERROR:MISMATCH_FLAG_MATE_UNMAPPED   77894
ERROR:MISMATCH_MATE_ALIGNMENT_START 2688537
ERROR:MISMATCH_MATE_REF_INDEX   1861112
$ 501T_alnHUMsrt.bam

$ 502N_alnHUM.bam
ERROR:MATES_ARE_SAME_END    755973
ERROR:MATE_NOT_FOUND    1270972
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 862516
ERROR:MISMATCH_FLAG_MATE_UNMAPPED   51766
ERROR:MISMATCH_MATE_ALIGNMENT_START 1511837
ERROR:MISMATCH_MATE_REF_INDEX   1094829
$ 502N_alnHUMsrt.bam

WouterDeCoster is therefore very right when saying that there is no need for intermediate files.

ADD REPLY • link 5.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

Just in case should any errors remain after the sorting of the file, I imagine it should be possible to filter the wrong reads using the SAM files. For instance, I have a file (unsorted) with the error:

MISMATCH_FLAG_MATE_NEG_STRAND

thus I used the [flag sheme][1] to identify the flag 41 for read paired (0x1) + mate unmapped (0x8) + mate reverse strand (0x20) and used the command samtools view -h -f 41 <file>.bam -o <file_subset>.bam to remove unwanted reads. The resulting file, though, contains only the header and in fact the size of the file passes from 31 GB to 640 bytes (and only the header shows up with samtools view -h).

What would be the right procedure to remove the unwanted reads?

ADD REPLY • link 5.6 years ago by marongiu.luigi ▴ 710

0

Entering edit mode

It turned out, this was not just a theoretical aspect: after sorting the file, the error persisted: ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 2566136. Interestingly, this error raised after aligning with HISAT2 but not with BWA; a false positive of HISAT or a false negative of BWA?