add UMI sequences to fastq read name
3
1
Entering edit mode
5.3 years ago
User6891 ▴ 330

Dear all,

I have paired-end fastq data generated with Illumina bcl2fastqv2.19 & sequenced on a Novaseq.The i5index is 7bp long, the i7 8bp long

R1.fastq.gz contains R1 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 1:N:0:AACTGAGG+ATGCGTC
CTGGCCGTCTCAGCCGAGAAGCCGAGGATTGAATGGGCATGGAGACTGAACTACCCCTCTCACCTTTAGAGGTGGCTCCTCCAAGTCGGGGTTGACGCCCG
+
FFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

R2.fastq.gz contains 6bp UMI sequence

@A00154:125:HGKTMDMXX:1:1101:10420:1000 2:N:0:AACTGAGG+ATGCGTC   
GCGCGT
+
FFFFFF

R3.fastq.gz contains R2 101bp reads:

@A00154:125:HGKTMDMXX:1:1101:10420:1000 3:N:0:AACTGAGG+ATGCGTC
CTTCATAGGCCACAAAAAGCCCATATATCAGTGTCATCCACTAAGCCTCAGACACTGCAGCACGGGCAGCGGCAGTGCCAGCTTCGCCCACACTGCCCCTC
+
FFFFFFFFFFFFFFFFFFFFFF:FF:FFF:FFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

In a downstream analysis I want to use UMI-tools for deduplication. However for that I need the UMI be part of the read name. @Instrument:RunID:FlowCellID:Lane:Tile:X:Y:UMI ReadNum:FilterFlag:0:IndexSequence or SampleNumber

There are tools to add a UMI to the read name when the UMI is present in the read itself. But in my case, the UMI is in a seperate fastq. How could this be achieved?

UMI Illumina fastq • 8.1k views
ADD COMMENT
0
Entering edit mode

Looking at the bcl2fastq manual, I have no idea how they made the UMI its own fastq. But bcl2fastq will trim the UMI off of the beginning of the read and put it in the read name if

Read1UMILength,6

TrimUMI,1

is in the sample sheet under "settings"

ADD REPLY
0
Entering edit mode

That's what we tried at first instance. However according to Illumina tech support, we couldn't do this because we were sequencing in dual index & the UMI was only in the i7. The option that you describe only work when you're sequencing single index.

ADD REPLY
0
Entering edit mode

I'm also curious, what bases mask did you use for the demultiplexing to get these three fastqs?

ADD REPLY
3
Entering edit mode
5.3 years ago

An awk solution:

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R1.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R1_umi.fastq.gz

$ awk -v FS="\t" -v OFS="\t" 'NR==FNR {split($1, id, " "); umi[id[1]]=$2;  next;} {split($1, id, " "); $1=id[1]":"umi[id[1]]" "id[2]; print $0}'  <(zcat R2.fastq.gz|paste - - - -) <(zcat R3.fastq.gz|paste - - - -)|tr "\t" "\n"|bgzip -c > R3_umi.fastq.gz

The fastq.gz files get uncompressed by zcat and the 4 line belonging to a read get tab delimited by paste.

awk saves the id and the umi in a list, where the key is the header until the first white space, and the value is the umi code.

If the second fastq file is read, we append the umi to the id and print out the line. Then the tabs are reverted to new lines by tr and the file get compressed using bgzip.

fin swimmer

ADD COMMENT
2
Entering edit mode
5.3 years ago
atalbot ▴ 20

If you align the R1 and R3 to the genome of your choice, you can annotate it with the UMI using the fgbio tool AnnotateBamWithUmis: https://fulcrumgenomics.github.io/fgbio/tools/latest/AnnotateBamWithUmis.html, this does require you to have sufficient memory to store the entire R2 (UMI) .fastq file.

ADD COMMENT
4
Entering edit mode
5.3 years ago

Here is what I would do - use UMI-tools and do two passes, one to add the UMI to read1 and one to add the UMI to read2:

umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R1.fastq.gz --stdout=R1.processed.fastq.gz --read2-stdout
umi_tools extract --bc-pattern=NNNNNN --stdin=R2.fastq.gz --read2-in=R3.fastq.gz --stdout=R3.processed.fastq.gz --read2-stdout
ADD COMMENT

Login before adding your answer.

Traffic: 2660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6