Simple FASTQ/A manipulation... how to add a single adapter sequence to 5' of all reads?
2
0
Entering edit mode
5.1 years ago
quickquark • 0

Hi everyone, and thanks in advance! I'm used to doing lots of trimming, substituting, etc on large FASTQ/A files, but now I need to add sequence arbitrarily at the beginning of all reads and I'm coming up short! Been searching a couple hours for a method via toolkit (fastx_toolkit, BBmap, etc.) or simple command (sed, awk, etc.).

So I'm looking to go from something like this:

>header
GTCTCAGATCGGAAGAGCACACGT
>header
CCGGTCCTGGTTGCAGATCGGAAG
>header
GTATCTCCTAAGATATAACAGGTTG
>header
AGGTACAGGTTGGATGATAAGTCC

to this:

>header
AAAAAAGTCTCAGATCGGAAGAGCACACGT
>header
AAAAAACCGGTCCTGGTTGCAGATCGGAAG
>header
AAAAAAGTATCTCCTAAGATATAACAGGTTG
>header
AAAAAAAGGTACAGGTTGGATGATAAGTCC

Alternatively, I can do the same with FASTQ files (also extending the quality lines to match), if there's already a tool out there for that. I'm not interested at quality at this point, as I've already merged paired-end reads with PandaSeq and filtered out anything but the highest quality reads.

FASTA FASTQ • 2.2k views
ADD COMMENT
0
Entering edit mode

While you have been given possible solutions below, you would be breaking fastq format if you do not add corresponding scores on the quality line. Example you showed above is neither valid fasta or fastq format.

ADD REPLY
0
Entering edit mode

Ah yes, sorry, I should have been more accurate with that in case others come across this. I'll edit it to look like a real FASTA.

ADD REPLY
1
Entering edit mode

quickquark : Please test @Pierre's solution. It should work and if it does you should accept that too. You can accept more than one answer if they work.

ADD REPLY
3
Entering edit mode
5.1 years ago

sed will do that:

$ sed 's|^[^@>]\(.*\)|AAAAAA\1|g' fastq.fq 
@header
AAAAAATCTCAGATCGGAAGAGCACACGT
@header
AAAAAACGGTCCTGGTTGCAGATCGGAAG
@header
AAAAAATATCTCCTAAGATATAACAGGTTG
@header
AAAAAAGGTACAGGTTGGATGATAAGTCC

$ sed 's|^[^@>]\(.*\)|AAAAAA\1|g' fasta.fa
>header
AAAAAATCTCAGATCGGAAGAGCACACGT
>header
AAAAAACGGTCCTGGTTGCAGATCGGAAG
>header
AAAAAATATCTCCTAAGATATAACAGGTTG
>header
AAAAAAGGTACAGGTTGGATGATAAGTCC

The first part between separators (|^[^@>]\(.*\)|) means match anything that does not start with @ or >, and capture the rest of the line in a group (parenthesis). The second part is the replacement, which means replace with AAAAAA followed by group 1 which was captured by the parenthesis.

Update: Added > to the non-matching character class part so it also works for FASTA files as well. See also comment below about FASTQ and multi-line FASTA files.

ADD COMMENT
1
Entering edit mode

manuel.belmadani : You should update your solution to reflect the change OP made to the original question when you have a chance.

ADD REPLY
0
Entering edit mode

I added the > in the character class. Just be careful that your FASTA files don't have reads over multiple lines, or it'll break (and add AAAAAA at each non-header begining of line, even if multiple lines are part of the same contiguous reads.) This use case is a bit more complicated than the provided input in the original question. Same thing if you have a complete FASTQ file (e.g. with the quality score); then you'd have to avoid editing the quality header and the quality line. Something like what Pierre suggested would work to only edit every 2nd line: sed '2~4 s/^/AAAAAAA/' fastq.fq

ADD REPLY
2
Entering edit mode
5.1 years ago
sed '2~2 s/^/AAAAAAA/' input.txt
ADD COMMENT

Login before adding your answer.

Traffic: 3088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6