How To Map The Fastq With Paired-End Seq Combined In One File
3
2
Entering edit mode
11.3 years ago
camelbbs ▴ 710

HI,

I found some rnaseq in encode were sequenced in paired-end but were combined in just one file like this:

@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
GTTAGGGAGGTTATGGAGGTTAGGGAGGTTATGGAGGTTATGGAGGTTAGCCTCGGTCTCCACCATAGCCTCCACCTCGGTCTCCTCCATAGCCTCCTCG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHHHHHDFBFFFEHHHHDGGG?HHDHAGFG7GC9C8:HHHHHHHHHHHHHHHIHHHHHHHHHHHHHGHHHHHHHHFHHHHHHHHHHD

They sequenced as 2X50bp but combine both end in one file. How to handle this type of data. Should I treat them as single end in mapping?

Thanks a lot !

rna-seq • 9.0k views
ADD COMMENT
0
Entering edit mode

you are only showing a single record above

ADD REPLY
0
Entering edit mode

This is the only record. The description is: 50bpPEmRNASeqFCAs51sequence, 2x50 paired end mRNA-seq READ1, ~ 0.5% phiX DNA spiked in, Performer: ILLUMINA-CA

ADD REPLY
0
Entering edit mode

I'd double-check your metadata. It would be very hard to create a single fastq file from two paired-end files. If these are publicly available data, can you share the links to the metadata and the files that you believe are incorrect?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

One can obtain fastq files directly from EBI ENA. The accession number for the sequencing data is ERX011212. A search for that at the EBI ENA website will lead to http://www.ebi.ac.uk/ena/data/view/ERX011212 where you can download the fastq files directly. Note that SRA and ENA mirror one another, so fastq files for SRA can almost always be obtained through going to EBI.

ADD REPLY
0
Entering edit mode

thanks a lot...

ADD REPLY
5
Entering edit mode
11.3 years ago
lh3 33k
awk 'NR%2{print>"read1.fq";print>"read2.fq"}NR%2==0{print substr($0,1,50)>"read1.fq";print substr($0,51)>"read2.fq"}' in.fq
ADD COMMENT
4
Entering edit mode
11.3 years ago

This happens a lot in SRA, you should split the file into two fastq contaning 50bp each. Something like this:

reads1.fastq
@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=50
GTTAGGGAGGTTATGGAGGTTAGGGAGGTTATGGAGGTTATGGAGGTTAG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHHHHHDFBFFFEHHHHDGGG?HHDHAGFG7GC9C8:

reads2.fastq
@ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=50
CCTCGGTCTCCACCATAGCCTCCACCTCGGTCTCCTCCATAGCCTCCTCG
+ERR030882.73513047 HWI-BRUNOP16X_0001:3:67:4997:91733#0 length=100
HHHHHHHHHHHHHHHIHHHHHHHHHHHHHGHHHHHHHHFHHHHHHHHHHD

Reads this http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc&f=fastq-dump

And try to use one of this commands to extract 2 fastq files from the SRA file:

fastq-dump --split-files SRR443885.sra

or

fastq-dump --split-3 SRR306633.lite.sra
ADD COMMENT
1
Entering edit mode
11.3 years ago
Geparada ★ 1.5k

I'm currently working with paired-end reads that are merged in the same way ...but each merged read is 155 nt length, so I don't have to split the reads a the half.

@SRR594435.4067 HWI-ST333_0042:5:1:16857:2275 length=155
GAACGTTGTCTTGCATTCTGTGGCAACTGCTTTAGCAAGGAGGGTCTTTCCAGTGCCAGGTGGGCCAACCATCAGCACAAAAAAGTTGCTTCAGGAAGCAGTGGTGTTACCAATGTTGATGCCAGAATTCTTTAAGGGCATTAGGAGACCCTGGA
+SRR594435.4067 HWI-ST333_0042:5:1:16857:2275 length=155
]Y]a\___b_\cc\b```^L`ZXXXZQO_UUVVVIOUVITSRMU]\\WaUYXO`WWX^BBBBBBBBBBBBBBBBBBBBBBYT^YLbbb\`b^Yc\bZU]\VOQQNQ_\K_ZXZZS_JZ`XUIOVLZ`^ZVVUVY^BBBBBBBBBBBBBBBBBBBB

I figured out that the 80 first nt came from the forward and the other 75 from the reverse, so I couldn't use fastq-dump. In the morning I wrote a code to split the reads at any position... may be useful if anyone have the same problem.

import sys
from Bio import SeqIO
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna
from Bio.SeqRecord import SeqRecord


def main (fastq):
    """ Splits the reads at N position. It generate two files rd1 and rd2 """

    N = 80    

    f = open(fastq)
    rd1 = open('rd1', 'w')
    rd2 = open('rd2', 'w')

    for read in SeqIO.parse(f, "fastq"):

        rd1_seq = read.seq[:N]
        rd2_seq = read.seq[N:]

        rd1_id = read.id + "_1"
        rd2_id = read.id + "_2"

        rd1_Q = read.letter_annotations["phred_quality"][:N]
        rd2_Q = read.letter_annotations["phred_quality"][N:]

        reads_rd1 = SeqRecord( rd1_seq, rd1_id, description = "" )
        reads_rd1.letter_annotations["phred_quality"] = rd1_Q    

        reads_rd2 = SeqRecord( rd2_seq, rd2_id, description = "" )
        reads_rd2.letter_annotations["phred_quality"] = rd2_Q    

        rd1.write(reads_rd1.format("fastq"))
        rd2.write(reads_rd2.format("fastq"))


if __name__ == '__main__':
    main(sys.argv[1])
ADD COMMENT

Login before adding your answer.

Traffic: 3274 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6