Biostar Beta. Not for public use.
MarkDuplicates: Mates are missing
0
Entering edit mode
13 months ago
Tezpur University, Assam, India

Hello, I am stuck in the problem of 'ValidateSamFile' of picard tools. I have checked this problem on different forums, but I didn't find any solution there.

I have used Hisat2 for the alignment of the paired-end fastq files (obtained after trimming by using trimmomatic tool) against Ensembl reference ids

hisat2 -p 8 --dta --summary-file summary -x 'path/to/Ensembl_ref/indexfile' -1 '/path/to/sample/S1_1p.fastq.gz' -2 '/path/to/sample/S1_2p.fastq.gz' -U '/path/to/sample/S1_1u.fastq.gz' -U '/path/to/sample/S2_2u.fastq.gz' -S S1.sam

After that I tried to validate the Sam file using Picard tools, 'ValidateSamFile'

java -jar picard.jar ValidateSamFile I=S.sam IGNORE_WARNINGS=true MODE=VERBOSE

which gave me the error,

> [Mon May 13 12:56:42 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64;
> OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12;
> Deflater: Intel; Inflater: Intel; Provider GCS is not available;
> Picard version: 2.20.0-SNAPSHOT WARNING   2019-05-13 12:56:42 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur. INFO   2019-05-13 12:57:16
>SamFileValidator   Validated Read    10,000,000 records. 
> Elapsed time: 00:00:34s.  Time for last 10,000,000:   34s.  Last read position: 8:115,115,834 INFO    2019-05-13 12:57:57 SamFileValidator    Validated Read    20,000,000 records. 
> Elapsed time: 00:01:15s.  Time for last 10,000,000:   40s.  Last read position: 7:43,608,287
>ERROR: Read name S1.916145.1, Mate not found for paired read
>ERROR: Read name S1.916145.2, Mate not found for paired read 
>ERROR: Read name S1.9977032.1, Mate not found for paired read
>ERROR: Read name S1.9977032.2, Mate not found for paired read 
>ERROR: Read name S1.4916847.1, Mate not found for paired read

As per the picard tools guidelines, I have used FixMateInformation, to fix the above error, by using the following command,

java -jar picard.jar FixMateInformation I=S1.sam O=new_fixed_S1.sam

The error seems to be fixed,

> [Mon May 13 12:52:03 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64;
> OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12;
> Deflater: Intel; Inflater: Intel; Provider GCS is not available;
> Picard version: 2.20.0-SNAPSHOT INFO  2019-05-13
> 12:52:03  FixMateInformation  Sorting input into queryname order.
> INFO  2019-05-13 12:53:34 SortingCollection   Creating merging iterator from 43 files
>INFO   2019-05-13 12:53:34 FixMateInformation Sorting by queryname complete.
>INFO   2019-05-13 12:53:34 FixMateInformation Output will be sorted by unsorted
>INFO   2019-05-13 12:53:34 FixMateInformation Traversing query name sorted records and fixing up mate pair information.
 >INFO 2019-05-13 12:53:36 FixMateInformation Processed 1,000,000 records. Elapsed time: 00:00:02s. Time for last 1,000,000:    2s.  Last read position:
 > */* INFO 2019-05-13 12:53:39 FixMateInformation  Processed 2,000,000 records. Elapsed time: 00:00:05s. Time for last 1,000,000: 2s. Last read position: 16:173,485
>INFO 2019-05-13 12:53:41 FixMateInformation Processed 3,000,000 records. Elapsed time: 00:00:07s. Time for last 1,000,000: 2s. Last read position: MT:2,103
>INFO   2019-05-13 12:53:44 FixMateInformation  Processed 4,000,000 records. Elapsed time: 00:00:10s. Time for last 1,000,000: 2s.  Last read position: 2:101,004,226

Further, I revalidated the processed sam file, by using ValidateSamFile,

java -jar picard.jar ValidateSamFile I=new_fixed_S1.sam IGNORE_WARNINGS=true MODE=SUMMARY IGNORE=MISSING_TAG_NM

resulted in

> ## HISTOGRAM  java.lang.String 
>Error Type Count
>ERROR:MATE_NOT_FOUND 18412234

that means the error is not getting fixed, I repeated the whole process again assuming that the error will get fixed with several attempts, but i am simply repeating the loop with no progress.

At last, I ignored the error, an I started with 'MarkDuplicates' tool of picard tools using the following command

java -jar picard.jar MarkDuplicates I=new_fixed_S1.sam O=new_S1.sam M=marked_dup_metrics.txt REMOVE_DUPLICATES=true READ_NAME_REGEX=null

It resulted

[Mon May 13 13:13:13 IST 2019] Executing as genomics@genomics-Precision-3630-Tower on Linux 4.13.0-1028-oem amd64; OpenJDK 64-Bit Server VM 1.8.0_191-8u191-b12-2ubuntu0.16.04.1-b12; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.0-SNAPSHOT
INFO 2019-05-13 13:13:13    MarkDuplicates Start of doWork freeMemory: 996413816; totalMemory: 1011351552; maxMemory: 14974713856
INFO 2019-05-13 13:13:13    MarkDuplicates Reading input file and constructing read end information.
INFO 2019-05-13 13:13:13    MarkDuplicates Will retain up to 54256209 data points before spilling to disk.

and the program is running since 2 hours, don't know what is the problem, whether samfile generated from Hisat2 is having some fault or my commands are wrong or I am missing any error fixing tool.

I followed the post on the biostars but from there also i didn't get any clue. Any help in this regard is deeply appreciated. Thank you.

ADD COMMENTlink
0
Entering edit mode

Can you show the trimmomaticcommand?

ADD REPLYlink
0
Entering edit mode
java -jar trimmomatic-0.38.jar PE -threads 4 -phred33 S1_1.fastq.gz S1_2.fastq.gz S1_1p.fastq.gz S1_1u.fastq.gz S1_2p.fastq.gz S1_2u.fastq.gz ILLUMINACLIP:TruSeq3-PE.fa:2:30:10:1:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
ADD REPLYlink
0
Entering edit mode

@ATpoint, any guess to solve this issue.

ADD REPLYlink
0
Entering edit mode

What are all these fastq files? You typically have one pair in a paired-end experiment.

ADD REPLYlink
0
Entering edit mode

Two fastq files are paired (-1/-2), and the remaining two are unpaired (-U) This is one of my post

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1