Question

Software to extract run information from FASTQ headers?

2

Entering edit mode

5.2 years ago

James Ashmore ★ 3.4k

Given a directory containing multiple FASTQ files, I would like to retrieve the run information from each file (e.g. flowcell, lane number, index e.t.c) and export the data to a CSV file. Before I write a script to do this myself, is anyone aware of any software which does this already?

fastq • 4.3k views

ADD COMMENT • link updated 5.2 years ago by Pierre Lindenbaum 161k • written 5.2 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

Writing a one-liner than does this should take less time than it did to write your post. This is homework, isn't it?

Edit. considering your rep, maybe it isn't..

ADD REPLY • link 5.2 years ago by 5heikki 11k

0

Entering edit mode

This isn't homework. There seems to be a surprising number of edge-cases depending on which version of CASAVA was used to convert from BCL to FASTQ format.

ADD REPLY • link 5.2 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

How is the required information encoded in the FASTQ headers, and is it at all? I am not sure if there is standard format for fastq headers. The information might be available as meta data from your sequencing provider, it might also be encoded in the file name. If you could provide an example of your filenames and headers, someone might be able to help you with a quick sed|grep|awk script.

ADD REPLY • link 5.2 years ago by Michael 54k

1

Entering edit mode

While there isn't a really a standard, read names from illumina machines have a more or less common format as long as the provider hasn't changed them. I guess it would be useful to extract things like the machine model and id, the folow cell id, match those things up to databases of what the strings mean (i.e. identify file one as coming from a HiSeq 2500 and file two as coming form a NovaSeq etc...).

Sounds like it might be quite useful, but I've not seen a tool that does it before.

ADD REPLY • link 5.2 years ago by i.sudbery 19k

1

Entering edit mode

In my case I want to create read group information as explained by GATK by automating the extraction of run information and creating the read groups for each run/sample.

ADD REPLY • link 5.2 years ago by James Ashmore ★ 3.4k

0

Entering edit mode

I think Illumina has a standardized format

So basically it's something like (maybe not exactly, I don't know if there are more than 1 lanes, tiles, index, whatever in one file..):

find /some/place -maxdepth 1 -name "*.fq" | xargs -I {} awk -v n="{}" 'BEGIN{FS=":";OFS=","}NR==1{print n,$2,$3..}' {}

ADD REPLY • link 5.2 years ago by 5heikki 11k

score 3 · Accepted Answer · 2019-02-08

I wrote illuminadir for my lab http://lindenb.github.io/jvarkit/IlluminaDirectory.html . output is a json or xml file;

$ find . -type f -name "*.fastq.gz" |  java  -jar dist/illuminadir.jar -j 
{
    "samples": [
        {
            "files": [
                {
                    "forward": {
                        "file-size": 82852,
                        "md5filename": "3854678b2381caa92758a980eb0d180d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R1_001.fastq.gz",
                        "side": 1
                    },
                    "id": "p1",
                    "index": "GATGAATC",
                    "lane": "2",
                    "md5pair": "6fb0b6d5dec57436724e8efcb736aed5",
                    "reverse": {
                        "file-size": 82625,
                        "md5filename": "5450dea531fb2d807402d6bcc59cf15d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R2_001.fastq.gz",
                        "side": 2
                    },
                    "split": "1"
                }
            ],
            "sample": "SAMPLE1"
        }
    ],
    "undetermined": []
}