Say I have two FASTQ
files wherein I happen to have two different sequences with the same header like so.
File 1
:
@ABC1234
ATGCATGC
+
<<<<<<<<
File 2
:
@ABC1234
TTAGTTTT
+
<<<<<<<<
If I were to concatenate these, then I'd have a situation where I have duplicated headers, but associated with unique sequences.
Tools like the de novo transcriptome assembler Trinity
seem to suggest pooling reads (i.e., different FASTQ
files together) for assembly (e.g., for differential expression analysis). But is the duplicated header-unique sequence situation not an issue in this situation? Do tools that accept FASTQ
inputs re-index the sequences and discard the headers?
If this is an issue, what's the best way to deal with this?
Are you monkeying with the fastq names? Usually reads are named after their instrument ID and run ID and coordinates on the flow cell, which is always going to be unique.
I'm not messing around with the sequence headers, no. I'm just asking to make sure I'm not making a colossal mistake by just
cat
-ing a couple of files together for assembly.R1 and R2 read might share the same name (but they often have a _1 or _2 appended to them) otherwise, read IDs are naturally going to be unique if you don't mess with them as they come off the Illumina instrument.
I wasn't thinking of concatenating the pairs themselves. That puts me in the clear then.
Thank you!!