different read names in paired-end data
1
0
Entering edit mode
3.1 years ago
debitboro ▴ 260

Dear all,

I've paired-end reads generated by ABI-Solid system 4. I've two fastq files R1.fastq and R2.fastq. I've looked at the content of the two files and I found that the reads didn't match in names (header) as follows which generates some issues for the analysis (for example when trimming the reads using cutadapt).

R1.fastq:

@SRR3159522.1 2_33_78 length=50
GGGATCAAAGGTGCCTAAGAAAGTTCTCACTAAGGGNATCTTCTACGCC
+SRR3159522.1 2_33_78 length=50
CCCDFFFFHHHHHJJJGJJJJJJJJIIGIIIIIJJJ#1?CGHDHHGIJI
@SRR3159522.2 2_36_51 length=50
CTGGTGCGAAAAGGTGAAATAAAAAAGAAGAACGAAGAAGCCGGTGCCA
+SRR3159522.2 2_36_51 length=50
BBCFDFFFHHHHHJGHHIJIJJJJJJIGIIJJIIIJJIGGIJJJHIHHH
@SRR3159522.3 2_36_551 length=50
CCACACCGGGTAAGCTGGTTTGGCGATGCGGGATGATCCGAACGTGGAG
...
...

R2.fastq

@SRR3159522.27470956 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+SRR3159522.27470956 2_33_78 length=35
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.27470957 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+SRR3159522.27470957 2_36_51 length=35
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ
@SRR3159522.27470958 2_36_551 length=35
GTCCTGNTNNNNNNNTGAACCAACACCTTTTGTG
...
...

As you can see the headers of the reads are different and don't match each other.

When I used cutadapt to trim the reads, I got a name matching error. I've tried to replace the headers of R2.fastq with the headers of R1.fastq to get the same headers and get rid of the issue but I don't know how to do it. I want to transform R2.fastq as follows:

@SRR3159522.1 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+SRR3159522.1 2_33_78 length=35
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.2 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+SRR3159522.2 2_36_51 length=35
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ
@SRR3159522.3 2_36_551 length=35
GTCCTGNTNNNNNNNTGAACCAACACCTTTTGTG
...
...

Someone can help me?

fastq file read name header ABI-Solid • 1.1k views
ADD COMMENT
1
Entering edit mode
3.1 years ago

seqkit solution

seqkit replace -p '(^SRR[0-9]+\.)[0-9]+' -r '${1}{nr}' R2.fastq

@SRR3159522.1 2_33_78 length=35
TGTTTNNNNNNNNNNNNAAATGCCAGATCCACAA
+
BCBFF############23AGHHHIJJIHIJJJJ
@SRR3159522.2 2_36_51 length=35
GTATGCTCCGTNANAGTCTACCAGCACTGACCAG
+
BB@FFFFFHHH#2#3AEHIJJIIJJIJJJJJIJJ

I'm not sure having anything after the + in a fastq file is strictly necessary, so you can probably remove it from the R1 file too. Someone pelase correct me if I'm wrong.

ADD COMMENT
0
Entering edit mode

The plus had to either match the name exactly, or be empty.

ADD REPLY
0
Entering edit mode

Thank you rpolicastro, it works fine

ADD REPLY

Login before adding your answer.

Traffic: 1658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6