I would like to deduplicate reads using UMIs. I have paired-end data where read 1 contains the mappable sequence, while read 2 only contains a 12 bp UMI sequence. Is it possible to extract the UMIs from read 2 and add them to the header of read 1 (for deduplication after mapping of read 1)? UMI-tools seems to be great for UMI extraction and deduplication, but I didn't find the option of transferring the extracted UMI from read 2 to the header of read 1. Is there a work-around within UMI-tools or are there alternative tools to conveniently achieve this?
Thanks for the quick answer. So, the UMIs from read 2 would then be added to the headers of both read 1 and read 2? Is that the default behavior of
UMI-tools extract
for paired-end reads with a UMI in just one of the reads?I'll be on the read hearders of both reads. For some historic reasons,
-S
is short for standard out. The input from-I
, with UMI trimeed and added to the read header is output there. Of course, because the read contains only the UMI, the reads will actaully be empty. The output from--read2-out
will be the reads from--read2-in
with the same UMI attached to the header. This is the default behavoir ofextract
with paired data.No only to read 1based on
--read2-out
being the only output. You won't have a need for Read 2 at that point since it only contains UMI?Yes, I won't need read 2 after UMI extraction. I'm just not sure if I understand where the extracted UMIs will end up. With
-I read2.fastq.gz -S read2_processed.fastq.gz --bc-pattern=NNNNNNNNNNNN
the UMIs should be added to the read headers inread2_processed.fastq.gz
, right? But will the same UMIs be added to the corresponding read headers inread1_processed.fastq.gz
, too?Considering @Ian Sudbery is author of
umi-tools
his advice is going to be accurate. Why don't you try the command out and see. I don't have an install ofumi-tools
to see what the-S
option does.Good point! I'm still in the library preparation process, but already want to figure out how to analyze the anticipated data. So, I created a small toy data set and Ian's suggested command works beautifully. The UMI tags are added to the read headers of both read 1 and read 2. Thank you so much for your answers and suggestions.