Question

Information on Illumina platinum genomes - ENA accession - PRJEB3246

0

Entering edit mode

18 months ago

Dhana ▴ 110

Hi,

I am trying to do benchmarking studies and have download the bio-project with accession number - PRJEB3246 from ENA. The description states

We are providing deep whole genome sequence data for the CEPH 1463 family in order to create a "platinum" standard comprehensive set of variant calls. These genomes include a trio (NA12877 NA12878 and NA12882) sequenced to greater than 200x depth of coverage, as well as a technical replicate (separate library and sequencing, but same DNA sample) of NA12882 also sequenced to greater than 200x. Additional information and analyses will be provided at www.platinumgenomes.org.

The information at the Coriell Institute page states that the CEPH 1463 family has 17 members (samples). The ENA page has 32 samples and I do not get any information on the corresponding sample names from their run accessions at NCBI site as well (sample name is stated as platinum 1,2 and so on). I would like to understand why there are 32 samples in ENA when there should have been 17 samples ( +1 technical replicate) and I am especially interested in identifying samples NA12878, NA12877 and NA12882.

How can I understand/find which "RUN Accession" correspond to which individuals.
Obtain information about the adapter sequences used. ( I did fastqc for few samples but it did not report any overrepresented sequences/adapter sequences)

MultiQC_report

ENA WGS illumina genome PRJEB3246 platinum • 1.4k views

ADD COMMENT • link 18 months ago by Dhana ▴ 110

1

Entering edit mode

Obtain information about the adapter sequences used. ( I did fastqc for few samples but it did not report any overrepresented sequences/adapter sequences)

It is not necessary to have adapter sequence in the data. These libraries were likely of very good quality (they were made by Illumina) and likely have little or no adapter. These were likely made using TruSeq. You can also use BBTools to try and identify adapter sequence: How to figure out adapter sequence for Illumina reads?

ADD REPLY • link 18 months ago by GenoMax 142k

0

Entering edit mode

Thank you, I tried to find out the adapter information using Illumina documentation and was having a hard time. BBTools worked like a charm.

ADD REPLY • link 18 months ago by Dhana ▴ 110

0

Entering edit mode

If you check https://www.ebi.ac.uk/ena/browser/view/PRJEB3246?show=reads and then "Sample Accession" then you get info from which individual it is. SAMEA1531956 is NA12877 and SAMEA1531955 is NA12878, that is the two individuals I see in PRJEB3246. Probably you can parse that out from the XML file or query their API somehow to automate such operations.

ADD REPLY • link 18 months ago by ATpoint 82k

0

Entering edit mode

Thank you for the reply. From GenoMax's reply and your comments I understand that the files correspond to 2 individuals (NA12878 and NA12877). I am curious to know why the fastq files for a single individual has been split into 14 and 18 separate experiments? Is it because of their depth and if we were to use them for analysis how should they be used?

ADD REPLY • link 18 months ago by Dhana ▴ 110

1

Entering edit mode

Yes, that is common for whole genome experiments or in general larger experiments that require depth which cannot be provided by a single lane. Nothing to worry about, you can just cat the individual files per R1 and R2 together.

ADD REPLY • link 18 months ago by ATpoint 82k

0

Entering edit mode

Thank you for the clarification.

ADD REPLY • link 18 months ago by Dhana ▴ 110

score 5 · Accepted Answer · 2022-10-19

Some of this information can be obtained from SRA using EntrezDirect:

$ esearch -db bioproject -query PRJEB3246 | elink -target biosample | esummary | xtract -pattern DocumentSummary -element Identifiers,Paragraph

BioSample: SAMEA1531955; SRA: ERS179577 We have generated paired-end sequence data covering the genome of an CEPH female individual to a sequence depth of more than 200-fold using the Illumina HiSeq 2000. This individual is a member of the population samples described in the PhaseI and PhaseII HapMap Projects and is from the CEPH/UTAH pedigree 1463 (abbreviation: CEPH). The DNA identifier for this individual is NA12878. We obtained the DNA sample NA12878 from The Coriell Institute for Medical Research. Starting with 1ug of DNA, and following random fragmentation, we generated a PCR-Free sequencing library with a median insert size of ~300 bp. 100 base sequence reads were generated from both ends of these templates using the Illumina HiSeq 2000. We carried out purity-filtering (PF) to remove mixed reads, where two or more different template molecules are close enough on the surface of the flow-cell to form a mixed or overlapping cluster. No other filtering of the data has been carried out prior to submission. We have also submitted equivalent sequence data for the father and one of the son's of family 1463 (NA12877 and NA12882).

BioSample: SAMEA1531956; SRA: ERS179576 We have generated paired-end sequence data covering the genome of an CEPH male individual to a sequence depth of more than 200-fold using the Illumina HiSeq 2000. This individual is a member of the population samples described in the PhaseI and PhaseII HapMap Projects and is from the CEPH/UTAH pedigree 1463 (abbreviation: CEPH). The DNA identifier for this individual is NA12877. We obtained the DNA sample NA12877 from The Coriell Institute for Medical Research. Starting with 1ug of DNA, and following random fragmentation, we generated a PCR-Free sequencing library with a median insert size of ~300 bp. 100 base sequence reads were generated from both ends of these templates using the Illumina HiSeq 2000. We carried out purity-filtering (PF) to remove mixed reads, where two or more different template molecules are close enough on the surface of the flow-cell to form a mixed or overlapping cluster. No other filtering of the data has been carried out prior to submission. We have also submitted equivalent sequence data for the mother and one of the son's of family 1463 (NA12878 and NA12882).

You can get additional information about the samples. I am showing only part of information below for space reasons.

$ esearch -db sra -query ERS179576 | efetch -format runinfo

Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR174310,2013-01-03 11:29:28,2013-01-03 11:26:41,207579467,41931052334,207579467,202,27214,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/ERR000/174/ERR174310/ERR174310.sralite.1,ERX150456,CT8624,WGS,RANDOM,GENOMIC,PAIRED,300,0,ILLUMINA,Illumina HiSeq 2000,ERP001775,PRJEB3246,,204921,ERS179576,SAMEA1531956,simple,9606,Homo sapiens,SAMEA1531956,,,,,,,no,,,,,ILLUMINA,ERA166477,,public,EF49CD9FE603F1797081A925E13116D7,1A46A45B2B17FCEC3F94B9222BF1E754

For second sample

$ esearch -db sra -query ERS179577 | efetch -format runinfo
Run,ReleaseDate,LoadDate,spots,bases,spots_with_mates,avgLength,size_MB,AssemblyName,download_path,Experiment,LibraryName,LibraryStrategy,LibrarySelection,LibrarySource,LibraryLayout,InsertSize,InsertDev,Platform,Model,SRAStudy,BioProject,Study_Pubmed_id,ProjectID,Sample,BioSample,SampleType,TaxID,ScientificName,SampleName,g1k_pop_code,source,g1k_analysis_group,Subject_ID,Sex,Disease,Tumor,Affection_Status,Analyte_Type,Histological_Type,Body_Site,CenterName,Submission,dbgap_study_accession,Consent,RunHash,ReadHash
ERR174324,2013-01-03 13:14:21,2013-01-03 13:13:12,223571196,45161381592,223571196,202,28183,,https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos5/sra-pub-zq-11/ERR000/174/ERR174324/ERR174324.sralite.1,ERX150470,CT8595,WGS,RANDOM,GENOMIC,PAIRED,300,0,ILLUMINA,Illumina HiSeq 2000,ERP001775,PRJEB3246,,204921,ERS179577,SAMEA1531955,simple,9606,Homo sapiens,SAMEA1531955,,,,,,,no,,,,,ILLUMINA,ERA166477,,public,9F957F28050A45993205BA2069486360,007F48E27E5E8611582D3B9F888EC498