Question

How to get proportion of specific sequence in fastq file

0

Entering edit mode

5.2 years ago

lkianmehr ▴ 100

Hello,

Do you have any idea about How to get proportion of specific sequence (defined as a separate fastq file=

@1
TAACCCTAACCCTAACCCTAACCC
+
JJJJJJJJJJJJJJJJJJJJJJJJ
@2
TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

) in several paired-read (fastq) or Bam file of RNA-seq

thanks in advance

specific sequence fastq RNA-Seq • 1.3k views

ADD COMMENT • link updated 5.2 years ago by shenwei356 8.4k • written 5.2 years ago by lkianmehr ▴ 100

0

Entering edit mode

Do you mean getting the first half of every read and save them as a separate file? Did you try anything yet?

ADD REPLY • link 5.2 years ago by ATpoint 81k

0

Entering edit mode

I just want to know how many of these specific sequence are there in these paired-read sequence to get proportion of that between all? I have tried by

 bbduk.sh in1=D1_L001_1.fastq.gz in2=D1_L001_2 out=D1.fq literal=TAACCCTAACCCTAACCCTAACCC k=24 mm=f int=f

but this command just extract all reads contain the sequence I think. I also have used Salmon to make an Index of fasta of this sequence and quantify all reads on, but was not good trick

ADD REPLY • link updated 5.2 years ago by lakhujanivijay 5.8k • written 5.2 years ago by lkianmehr ▴ 100

1

Entering edit mode

If TAACCCTAACCCTAACCCTAACCC is the full length sequence you are trying to identify you should be able to get the number that contain the sequence by just running the command and not providing an out= file name. This will do the operation and generate relevant statistics file. Sounds like that is all you are interested in. There should be a number for total sequences processed.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

I have got the result like below, what does it mean? contaminants determine the number of reads including that specific sequence or something else?

Input:                      65975862 reads      6554014910 bases.
Contaminants:               195232 reads (0.30%)    19519262 bases (0.30%)
Total Removed:              1040136 reads (1.58%)   61775988 bases (0.94%)
Result:                     64935726 reads (98.42%)     6492238922 bases (99.06%)

Time:                           133.851 seconds.
Reads Processed:      65975k    492.91k reads/sec
Bases Processed:       6554m    48.97m bases/sec

ADD REPLY • link updated 5.2 years ago by GenoMax 141k • written 5.2 years ago by lkianmehr ▴ 100

1

Entering edit mode

Is that the output from the command you included above? BBDuk by default works in filter mode. Take a look at this guide to get additional details (and also look at the in-line help).

At first glance Total Removed: 1040136 reads (1.58%) should be the reads that have the sequence you specified in the literal directive. Since you are running this in PE mode either/both reads may have the sequence. If you want to be specific you could run the files individually to get a more precise number.

ADD REPLY • link 5.2 years ago by GenoMax 141k