Union of unaligned fastq reads
0
1
Entering edit mode
6.8 years ago
Jeffin Rockey ★ 1.3k

Hi,

Raw data is a paired end fastq file.

Aligned it with Genome-1 (using STAR) and got unaligned R1 and Unaligned R2.

Also aligned it with Genome-2 (using STAR) and got unaligned R1 and Unaligned R2.

Please advise what would be the best method/tool to obtain the 'union' fastq of the unaligned reads from the unaligned of both the genome alignments.

Jeffin

RNA-Seq • 1.6k views
ADD COMMENT
1
Entering edit mode

Do you really want the union or rather the intersection?

ADD REPLY
1
Entering edit mode

Union itself is the requirement

ADD REPLY
1
Entering edit mode

Creating the union would be simply combining the unaligned files together? Just need to avoid duplicates.

ADD REPLY
1
Entering edit mode

Assuming unaligned R1/R2 are .bamfiles:

samtools view unaligned.R1.genome1.bam | cut -f 1 > unaligned.R1.txt
samtools view unaligned.R1.genome2.bam | cut -f 1 >> unaligned.R1.txt
grep -A 3 -F -f <(sort -u unaligned.R1.txt) original.R1.fq | grep -v "\-\-" > union.unaligned.R1.fq

Analogously for .R2

ADD REPLY
1
Entering edit mode

Thanks.

Could you please provide some detail on whats happening in the grep line?

ADD REPLY
1
Entering edit mode
  1. We fgrep (grep -F, i.e. take patterns from file) the original fastq file for all the read ids of the union of unaligned reads.
  2. -f takes the pattern file, which here (because I'm lazy) are given as a process substitution: <(sort -u unaligned.R1.txt) provides the output of the sort operation as a 'file' (the shell pretends this is a file). Alternatively, one could have done the sort -u into another file and used that filename as argument to the -f.
  3. grep -A 3 returns the line with the match + the following 3 lines. However, if there are multiple hits, then they will be separated by --, which is why the output of the first grep is piped into a grep -v, that ignores the -- lines.
ADD REPLY
1
Entering edit mode

Thanks a lot @cschu181 . Good to learn about the -A functionality in grep :)

ADD REPLY
1
Entering edit mode

Hi, The best way to avoid duplicate fastq entries is the aspect what I am doubtful about.

If I write a small script it would be easily doable.But I wanted to know whether there is better method to combine while keeping duplicates away

ADD REPLY

Login before adding your answer.

Traffic: 2497 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6