Biostar Beta. Not for public use.
Is there a way to know sequences length when querying ENA programmatically?
1
Entering edit mode
20 months ago
Pavel Senin ♦ 1.9k
Los Alamos, NM

Hi folks:

I am trying to get raw RNASeq data from ENA and wonder how to know the length of the sequences in those FASTQ files without downloading them?

In the query design I follow their guide, and use the "nominal_length>XXX" in the query, but this filtering mechanism doesn't work...

Thanks!

RNA-Seq ENA • 1.4k views
ADD COMMENTlink
2
Entering edit mode
16 months ago
Renesh ♦ 1.6k
United States

Get the count of total bases and number of reads from ENA, and do below calculation,

Read length = Total Bases/ number of reads
ADD COMMENTlink
0
Entering edit mode

That works, but you also need to divide by 2 for paired-end libraries :)!

ADD REPLYlink
1
Entering edit mode
17 months ago
Chris S. • 290
United States

The Paired nominal length does not seem to work in the ENA query builder here or at least it doesn't autofill into the query box, but you can add it to the URL string. For example, in the query builder, select domain = read and Library layout= Paired and optionally a taxon name like Bacteria , then you get this query.

tax_eq(2) AND library_layout="PAIRED"

Hit search and then add nominal length to the URL and get this URL string and query

tax_eq(2) AND library_layout="PAIRED" AND nominal_length>300

The nominal_length is just a filter, so you may need to check read experiment XML for actual length and stdev.

UPDATE : add &result=read_experiment&display=xml&download=txt to the URL and parse NOMINAL_LENGTH from the summary XML

ADD COMMENTlink
0
Entering edit mode

also accepted. XML is a better way to get data, parsing/debugging it is the pain though.

ADD REPLYlink

Login before adding your answer.

Similar Posts
Loading Similar Posts
Powered by the version 2.1