1000 genomes phase 3 data representation
1
0
Entering edit mode
6.5 years ago
puneet.as • 0

Hello, Can i get a brief overview on data that is stored in 1000 genomes phase 3 dataset for a single sample name such as HG00113 there a multitude of fastq paired end reads in their ftp server

ERR020088_1.filt.fastq.gz 24.6 GB ERR020088_2.filt.fastq.gz 24.7 GB ERR229776.filt.fastq.gz 360 MB
ERR229776_1.filt.fastq.gz 9.4 GB
ERR229776_2.filt.fastq.gz 9.7 GB
SRR070517.filt.fastq.gz 7.4 MB
SRR070517_1.filt.fastq.gz 2.2 GB
SRR070517_2.filt.fastq.gz 2.3 GB
SRR070802.filt.fastq.gz 6.8 MB
SRR070802_1.filt.fastq.gz 2.2 GB
SRR070802_2.filt.fastq.gz 2.3 GB

can someone explain as how to interpret the data is it the same sample or different samples that are included in the same run accession.

why do i get multiple set of paired end reads ??

1000 genomes phase 3 sequence data • 1.4k views
ADD COMMENT
1
Entering edit mode
6.5 years ago

Hey,

Are you sure that those files relate to HG00113? - they appear to relate to HG00101: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/data/HG00101/sequence_read/


Firstly, the 'filt' suffix just indicates that the 1000 Genomes consortium has done some QC filtering of the reads, which may or may not be welcome:

These are the checks the DCC makes on the archive fastq files.

    Syntax Checks:

    -Each header line begins with @
    -The third line always starts with a +
    -There are four lines in each entry (implied by the above two rules)
    -On line3, if a name follows the + sign, the name has to match the one found in line1
    -The sequence and quality lines are the same length
    -For paired end files, the _1 and _2 files have the same number of reads in them. 
    -For SOLID colourspace fastq, each read starts with a base followed by a string of numbers

    Sequence Checks:

    -Read is longer than 35bp for Solexa, 25bp for Solid, and 30 bp for 454
    -Read does not contain any N's in the first 25, 30 or 35bp
    -Quality values are all 2 or higher in the first 25bp, 30bp or 35bp
    -The reads contain more than one type of base in the first 25, 30, or 35bp
    -Read does not contain more than 50% Ns in its whole length
    -Read does not contain characters other than ATGCN (this rule does not apply to SOLID reads)
  

The output files get the extension .filt.fastq.gz to indicate they have been filtered.

[source: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/historical_data/former_toplevel/README.sequence_data]


In terms of the files themselves, it's DNA from the same biopsy that's being sequenced, but by different centers, sequencers, and protocols (some are even exome-seq samples). information on each samples can be pulled from the following 64 megabyte file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase3/20130502.phase3.sequence.index

It may be useful download this and the using grep to extract the information for your files of interest!

Good luck,

Kevin

ADD COMMENT
0
Entering edit mode

Thanks a ton Kevin, this explained the confusion that i had... thanks for the sources will go through them !

ADD REPLY
1
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY
0
Entering edit mode

Thanks Wouter! You the man!

ADD REPLY
1
Entering edit mode

I feel like a bot sometimes though. Anyways, keeps me busy when waiting for scripts to finish/travis to check my build/...

ADD REPLY
0
Entering edit mode

Okay great - good luck with it! It's a lot of data

ADD REPLY

Login before adding your answer.

Traffic: 1955 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6