Question

Extracting counts of reads into a matrix

0

Entering edit mode

6.1 years ago

rmash ▴ 20

I have a txt file with sequences

1   TACCCTGTAGAACCGAATTTGT  miRNA   mmu-mir-10b PM
2   GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   PM
3   TACCCTGTAGATCCGAATTTGT  miRNA   mmu-mir-10a PM
4   GCATTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA    Mus_musculus_tRNA-Gly-GCC-2-2   IM
5   ACCCTGTAGAACCGAATTTGT   other   other   NA
6   TACCCTGTAGAACCGAATTTG   other   other   NA
7   GCATTGGTTCAGTGGTAGAATTCTCGCCT   tRNA    Mus_musculus_tRNA-Gly-GCC-2-7   IM
8   GCATTTGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   IM
9   TACCCTGTAGAACCGAATTTGTG miRNA   mmu-mir-10b PM
10  GGTGAATATAGTTTACAAAAAACATTAGACTGTGAATC  tRNA    tRNA-His    IM

I'd like a count matrix based on the 3rd value in each line such that I have something like. What's the best way to do this?

mmu-mir-10b 2
tRNA-His 1
other 2

etc

counts reads unix grep • 1.7k views

ADD COMMENT • link updated 6.1 years ago by cpad0112 21k • written 6.1 years ago by rmash ▴ 20

0

Entering edit mode

That means you have no repeated reads. I don't see any repeated read in given example.

ADD REPLY • link updated 6.1 years ago by GouthamAtla 12k • written 6.1 years ago by venu 7.1k

0

Entering edit mode

The command did what it supposed to do. You have repeated substrings, but not repeated strings. For example, "ACCCTGTAGATCCGAATTTGT" is repeated 4 times in other sequences but as a substring.

ADD REPLY • link 6.1 years ago by GouthamAtla 12k

0

Entering edit mode

Yes, I was doing something silly, but I've fixed it and there should now be repeats, as the 4th item in each line, how would I go about making a count matrix from this? Sorry, i'm a novice

ADD REPLY • link 6.1 years ago by rmash ▴ 20

0

Entering edit mode

Did you derive this strange file from your alignments in SAM/BAM format? Since you are dealing with miRNA (which should have no gaps) why not just count the instances of the miRNA a read is aligned to (in third field of your alignment file) to get the count matrix.

You are interested in miRNA counts and not read counts, is that correct?

ADD REPLY • link 6.1 years ago by GenoMax 141k

0

Entering edit mode

yes thats what id like to do but dont know how to in unix

ADD REPLY • link 6.1 years ago by rmash ▴ 20

0

Entering edit mode

thats correct, have edited main post to make it clearer

ADD REPLY • link 6.1 years ago by rmash ▴ 20

score 0 · Answer 1 · 2018-03-25

0

Entering edit mode

6.1 years ago

rmash ▴ 20

This seems to work

awk '{seen[$4]++} END{for(x in seen) print x, seen[x]}' test.txt > test_counts.txt

Is there a better option?

ADD COMMENT • link 6.1 years ago by rmash ▴ 20

0

Entering edit mode

Better in what terms? If it gives you the right answer then this is fine.

Just to be safe I suggest that you do something similar on your original SAM file so there are less chance of errors.

ADD REPLY • link 6.1 years ago by GenoMax 141k

score 0 · Answer 2 · 2018-03-25

0

Entering edit mode

6.1 years ago

cpad0112 21k

 $ datamash -s -g 4 count 4 < test.list 
Mus_musculus_tRNA-Gly-GCC-2-2   1
Mus_musculus_tRNA-Gly-GCC-2-7   1
Mus_musculus_tRNA-Gly-GCC-4-1   2
mmu-mir-10a 1
mmu-mir-10b 2
other   2
tRNA-His    1

test.list is data from OP and first column is serial numbers from 1 to 10 from OP.

ADD COMMENT • link 6.1 years ago by cpad0112 21k