Biostar Beta. Not for public use.
Question: How To Check If Illumina Fastq Is Single Or Paired End With Minimal Sequence Id
7
Entering edit mode

Hi all, I am trying to check if a FASTQ is single or paired end. From wikipedia I saw that default format has to be like this:

@HWUSI-EAS100R:6:73:941:1973#0/1

but in my case the sequence id is like

@HWUSI-EAS100R:6:73:941:1973

with missing # part.. Can I assume that it is single end? I could not find a good source to learn from about it.. can you also point me to something like this? Thanks

ADD COMMENTlink 6.0 years ago fbrundu • 280 • updated 6.0 years ago Chris Fields ♦ 2.1k
Entering edit mode
2

Also check for duplicated read ids, to verify that the two files were not interleaved. This is as simple as grepping for a few read ids:

grep HWUSI-EAS100R:6:73:941:1973 readfile.fq

or just cut out the read ids and count them this way:

grep @HWUSI-EAS100R readfile.fq | head -100000  | sort | uniq -c | sort -rgk 1,1 | head
ADD REPLYlink 6.0 years ago
Istvan Albert
80k
Entering edit mode
0

I am using the second one and I'll let you know later, (why the sort flag -k ? And the first sort - after head -?). Moreover, if it is available only one file, can I assume that is single end? Thanks

ADD REPLYlink 6.0 years ago
fbrundu
• 280
Entering edit mode
1

there was a missing | now edited. the flag is perhaps not needed in this case, it is just habit to restrict it to the field that I am interested in rather than all fields.

Having one file seems to indicate a single end data. But if you have identical read ids it might be interleaved paired end data

ADD REPLYlink 6.0 years ago
Istvan Albert
80k
Entering edit mode
0

Ah ok.. I misread your first comment. Perfect, so files can be interleaved. So I have to check if seq id before # have duplicates, because paired end read have the same seqid, am I correct?

ADD REPLYlink 6.0 years ago
fbrundu
• 280
Entering edit mode
2

Yes, as Istvan suggest just grep one id, you should have one results if it's single-end. If you have two results, it's an interleaved paired-end file

ADD REPLYlink 6.0 years ago
Nicolas Rosewick
7.7k
Entering edit mode
0

Ah ok, I got it. So there is always a match between two reads (or conversely there are no unique reads with a given seq id) if they are paired end. Add a new answer or update the existing with all info if you can, so I can accept it for future references.

ADD REPLYlink 6.0 years ago
fbrundu
• 280
Entering edit mode
1

Yes if paired-end data, you will have the same id for one pair of reads.

ADD REPLYlink 6.0 years ago
Nicolas Rosewick
7.7k
4
Entering edit mode

If it's paired-end you should have two files ( R1.fastq and R2.fastq). In the id of the read you should have an information about the pair. Here an example of one pair of reads; The information is at the end of the read in 1 :N:0:28. The 1 represent read from R1.fastq. FYI, it's from a HiSeq 2000 run

R1.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 1:N:0:28
NGCTCCTAGGTCGGCATGATGGGGGAAGGAGAGCATGGGAAGAAATGAGAGAGTAGCAA
+
#8BCCGGGGGFEFECFGGGGGGGGG@;FFGGGEG@FF<EE<@FFC,CEGCCGGFF<FGF

R2.fastq :

@M00991:61:000000000-A7EML:1:1101:14011:1001 2:N:0:28
TTGCTACTCTCTCATTTCTTCCCATGCCTTCCTTCCCCCATCATGCCGACCTAGGAGCC
+
CCCCC,;FF,EA9CEE<6CFAFGGGD@,,6CC<FA@FG:FF8@F9EE7@FGCFGFFFFG

So in your example, it should be single-end data..

ADD COMMENTlink 6.0 years ago Nicolas Rosewick 7.7k
Entering edit mode
0

Two things: the seq id I put is only an example, my seq id is different but formatted in the same way; the seq id you provided as example is clear to me, but it seems to be the new format not the old one - see wikipedia under "Illumina sequence identifiers". Moreover, I have to say that I have only one fastq (and only one is avalaible): can I assume that it is single end because it has only one file?

ADD REPLYlink 6.0 years ago
fbrundu
• 280
1
Entering edit mode

As @NicoBxl indicated, there are normally two files, one for each pair, so the best bet is that this is single-end. However, you'll note that the ID in that example (the part prior to the space, not including the pairing info) matches in both cases. This ID contains the lane coordinates for the cluster the read belongs to, a paired read originates from the same cluster and would have the same coordinates.

So, if you really want to be sure, you could parse through the data to make sure all the IDs are unique; if you have two sequences per ID this may mean your sequence is paired but interleaved.

ADD COMMENTlink 6.0 years ago Chris Fields ♦ 2.1k

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.0