How to edit a fq.gz file
3
0
Entering edit mode
3 months ago

Hi everyone:

I would like to delate the last 15 letters from the 1st line of each read:

@A00261:889:HHW5HDSX7:3:1101:2528:1016 1:N:0:GCCAATATCT+AGATCTCGGT
GNTTGAATTCAATGTGAGCAGAAGCAAGCCAGATAAAACACAAACAGTAAATTAAGCTAAGTTCTGAAGAGCTTGGCTTCTGCCAATAGAGCAAACAACCTGGGCTATGTTAAATTCGCCTCTGGCGGCTTCAGTCTACATTACTAGAGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF

@A00261:889:HHW5HDSX7:3:1101:3884:1016 1:N:0:GCCAATATCT+AGATCTCGGT
GNTTGAATTCAATGCAGAATGCTGAGAGCTTGCCACTTCTGGCAATTAACTTGAGATAAACAAAAGTGGTAAGAGGAGGCATTAAGTACCCACCTGCAGAAGACTGGCTCAGTGCTGAGTGCTCTCAACAGATGAGTGCTAATTGCAATG
+
F#FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:F:FFFFFFF

Any help would be appreciated.

fastq • 1.3k views
ADD COMMENT
1
Entering edit mode

This sounds a lot like an XY problem (https://xyproblem.info/). Can you please explain what problem you're trying to solve? (more bases than quality string, removing library indices from read names, etc)

ADD REPLY
0
Entering edit mode

are you planning on trimming the read name? you should provide an example

ADD REPLY
0
Entering edit mode

Hi:

I want to edit the header line. I am looking for something like this:

original:

A00261:889:HHW5HDSX7:3:1101:2528:1016 1:N:0:GCCAA**TATCT+AGATCTCGGT**
GNTTGAATTCAATGTGAGCAGAAGCAAGCCAGATAAAACACAAACAGTAAATTAAGCTAAGTTCTGAAGAGCTTGGCTTCTGCCAATAGAGCAAACAACCTGGGCTATGTTAAATTCGCCTCTGGCGGCTTCAGTCTACATTACTAGAGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF

result:

A00261:889:HHW5HDSX7:3:1101:2528:1016 1:N:0:**GCCAAT**
GNTTGAATTCAATGTGAGCAGAAGCAAGCCAGATAAAACACAAACAGTAAATTAAGCTAAGTTCTGAAGAGCTTGGCTTCTGCCAATAGAGCAAACAACCTGGGCTATGTTAAATTCGCCTCTGGCGGCTTCAGTCTACATTACTAGAGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
ADD REPLY
0
Entering edit mode

This requirement makes no sense. Why do you need to trim the header by position?

ADD REPLY
0
Entering edit mode

To trim off 15 bases from every fourth line, starting with the first you can do something as simple as:

awk 'NR % 4 == 1 {sub(/.{15}$/, "")} {print}' test.fq 
ADD REPLY
1
Entering edit mode
3 months ago
Ram 43k

Adding this as an answer since it technically answers OP's question:

cat sample.fq
@SEQ_ID789012345678901234567890
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

cat sample.fq | seqkit replace -p '.{15}$'
@SEQ_ID789012345
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
ADD COMMENT
0
Entering edit mode

Following works as well (using OP's example to make it specific for their use case) :

$ sed '/^@A00261/s/.\{15\}$//' test.fq
@A00261:889:HHW5HDSX7:3:1101:2528:1016 1:N:0:GCCAAT
GNTTGAATTCAATGTGAGCAGAAGCAAGCCAGATAAAACACAAACAGTAAATTAAGCTAAGTTCTGAAGAGCTTGGCTTCTGCCAATAGAGCAAACAACCTGGGCTATGTTAAATTCGCCTCTGGCGGCTTCAGTCTACATTACTAGAGG
+
F#FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFF
@A00261:889:HHW5HDSX7:3:1101:3884:1016 1:N:0:GCCAAT
GNTTGAATTCAATGCAGAATGCTGAGAGCTTGCCACTTCTGGCAATTAACTTGAGATAAACAAAAGTGGTAAGAGGAGGCATTAAGTACCCACCTGCAGAAGACTGGCTCAGTGCTGAGTGCTCTCAACAGATGAGTGCTAATTGCAATG
+
F#FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,:F:FFFFFFF
ADD REPLY
0
Entering edit mode

True, seqkit works with gzipped files and writes gzipped out as well - not saying we can't do it with a simple zcat | sed '1~4s/..' | gzip > but seqkit wraps the nuts and bolts well.

seqkit replace -p '.{15}$' in.fastq.gz -o out.fastq.gz
ADD REPLY
0
Entering edit mode
3 months ago
GenoMax 141k

I would like to delate the last 15 letters from the 1st line of each read

If this means trim/shorten the sequences by 15 bases then you should be able to do this using bbduk.sh (look at forcetrimright=N option. Replace N with (your readlength -15 ). Other trimming programs should have similar options.

forcetrimright=0    (ftr) If positive, trim bases to the right of this position
                    (exclusive, 0-based)
ADD COMMENT
1
Entering edit mode

I think OP wants to partially (?) remove the I1/I2 barcoding content. They need the header edited, not the sequence.

ADD REPLY
1
Entering edit mode

Editing fastq header like that does not makes sense but then OP may be the only person who can clarify their requirement.

ADD REPLY
0
Entering edit mode

You're right. Like LChart said, this looks like an XY problem.

ADD REPLY
0
Entering edit mode
3 months ago
size_t ▴ 120

Try this tool:

fqkit trim -r 15 your.fq.gz
ADD COMMENT
0
Entering edit mode

Interesting tool. Did you do any bechmarking against existing tools such as seqtk and seqkit?

BTW, trim has a stub as far as documentation goes. It trims sequences and not headers, right? If so, the command is not relevant to OP's use case.

ADD REPLY
0
Entering edit mode

i don't know why I can't add a reply, @Ram

It trims sequences and not headers, right? yes, this command just trim sequences and not headers, my bad.

ADD REPLY
0
Entering edit mode

seqkit has a tool to trim header right ?

ADD REPLY
0
Entering edit mode

Yeah, seqkit replace can be used with a '.{15}$' regex. I still think this is a really weird thing to do.

ADD REPLY

Login before adding your answer.

Traffic: 1762 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6