Header conflicts in concatenated FASTQ file?

0

Entering edit mode

3.0 years ago

Dunois ★ 2.5k

Say I have two FASTQ files wherein I happen to have two different sequences with the same header like so.

File 1:

@ABC1234
ATGCATGC
+
<<<<<<<<

File 2:

@ABC1234
TTAGTTTT
+
<<<<<<<<

If I were to concatenate these, then I'd have a situation where I have duplicated headers, but associated with unique sequences.

Tools like the de novo transcriptome assembler Trinity seem to suggest pooling reads (i.e., different FASTQ files together) for assembly (e.g., for differential expression analysis). But is the duplicated header-unique sequence situation not an issue in this situation? Do tools that accept FASTQ inputs re-index the sequences and discard the headers?

If this is an issue, what's the best way to deal with this?

FASTQ RNA-seq concatenate • 923 views

ADD COMMENT • link 3.0 years ago by Dunois ★ 2.5k

0

Entering edit mode

Are you monkeying with the fastq names? Usually reads are named after their instrument ID and run ID and coordinates on the flow cell, which is always going to be unique.

ADD REPLY • link 3.0 years ago by swbarnes2 14k

0

Entering edit mode

I'm not messing around with the sequence headers, no. I'm just asking to make sure I'm not making a colossal mistake by just cat-ing a couple of files together for assembly.

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k

1

Entering edit mode

R1 and R2 read might share the same name (but they often have a _1 or _2 appended to them) otherwise, read IDs are naturally going to be unique if you don't mess with them as they come off the Illumina instrument.

ADD REPLY • link 3.0 years ago by swbarnes2 14k

0

Entering edit mode

I wasn't thinking of concatenating the pairs themselves. That puts me in the clear then.

Thank you!!

ADD REPLY • link 3.0 years ago by Dunois ★ 2.5k

Login before adding your answer.