Software to extract run information from FASTQ headers?
1
2
Entering edit mode
5.2 years ago
James Ashmore ★ 3.4k

Given a directory containing multiple FASTQ files, I would like to retrieve the run information from each file (e.g. flowcell, lane number, index e.t.c) and export the data to a CSV file. Before I write a script to do this myself, is anyone aware of any software which does this already?

fastq • 4.3k views
ADD COMMENT
0
Entering edit mode

Writing a one-liner than does this should take less time than it did to write your post. This is homework, isn't it?

Edit. considering your rep, maybe it isn't..

ADD REPLY
0
Entering edit mode

This isn't homework. There seems to be a surprising number of edge-cases depending on which version of CASAVA was used to convert from BCL to FASTQ format.

ADD REPLY
0
Entering edit mode

How is the required information encoded in the FASTQ headers, and is it at all? I am not sure if there is standard format for fastq headers. The information might be available as meta data from your sequencing provider, it might also be encoded in the file name. If you could provide an example of your filenames and headers, someone might be able to help you with a quick sed|grep|awk script.

ADD REPLY
1
Entering edit mode

While there isn't a really a standard, read names from illumina machines have a more or less common format as long as the provider hasn't changed them. I guess it would be useful to extract things like the machine model and id, the folow cell id, match those things up to databases of what the strings mean (i.e. identify file one as coming from a HiSeq 2500 and file two as coming form a NovaSeq etc...).

Sounds like it might be quite useful, but I've not seen a tool that does it before.

ADD REPLY
1
Entering edit mode

In my case I want to create read group information as explained by GATK by automating the extraction of run information and creating the read groups for each run/sample.

ADD REPLY
0
Entering edit mode

I think Illumina has a standardized format

So basically it's something like (maybe not exactly, I don't know if there are more than 1 lanes, tiles, index, whatever in one file..):

find /some/place -maxdepth 1 -name "*.fq" | xargs -I {} awk -v n="{}" 'BEGIN{FS=":";OFS=","}NR==1{print n,$2,$3..}' {}

ADD REPLY
3
Entering edit mode
5.2 years ago

I wrote illuminadir for my lab http://lindenb.github.io/jvarkit/IlluminaDirectory.html . output is a json or xml file;

$ find . -type f -name "*.fastq.gz" |  java  -jar dist/illuminadir.jar -j 
{
    "samples": [
        {
            "files": [
                {
                    "forward": {
                        "file-size": 82852,
                        "md5filename": "3854678b2381caa92758a980eb0d180d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R1_001.fastq.gz",
                        "side": 1
                    },
                    "id": "p1",
                    "index": "GATGAATC",
                    "lane": "2",
                    "md5pair": "6fb0b6d5dec57436724e8efcb736aed5",
                    "reverse": {
                        "file-size": 82625,
                        "md5filename": "5450dea531fb2d807402d6bcc59cf15d",
                        "path": "./src/test/resources/SAMPLE1_GATGAATC_L002_R2_001.fastq.gz",
                        "side": 2
                    },
                    "split": "1"
                }
            ],
            "sample": "SAMPLE1"
        }
    ],
    "undetermined": []
}
ADD COMMENT
0
Entering edit mode

Thanks Pierre - should be able to convert from JSON to CSV on the fly.

ADD REPLY

Login before adding your answer.

Traffic: 2859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6