Question: Extracting counts of reads into a matrix
0
Entering edit mode

I have a txt file with sequences

1   TACCCTGTAGAACCGAATTTGT  miRNA   mmu-mir-10b PM
2   GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   PM
3   TACCCTGTAGATCCGAATTTGT  miRNA   mmu-mir-10a PM
4   GCATTGTGGTTCAGTGGTAGAATTCTCGCCT tRNA    Mus_musculus_tRNA-Gly-GCC-2-2   IM
5   ACCCTGTAGAACCGAATTTGT   other   other   NA
6   TACCCTGTAGAACCGAATTTG   other   other   NA
7   GCATTGGTTCAGTGGTAGAATTCTCGCCT   tRNA    Mus_musculus_tRNA-Gly-GCC-2-7   IM
8   GCATTTGTGGTTCAGTGGTAGAATTCTCGCCT    tRNA    Mus_musculus_tRNA-Gly-GCC-4-1   IM
9   TACCCTGTAGAACCGAATTTGTG miRNA   mmu-mir-10b PM
10  GGTGAATATAGTTTACAAAAAACATTAGACTGTGAATC  tRNA    tRNA-His    IM

I'd like a count matrix based on the 3rd value in each line such that I have something like. What's the best way to do this?

mmu-mir-10b 2
tRNA-His 1
other 2

etc

ADD COMMENTlink 22 months ago rmash • 0 • updated 22 months ago cpad0112 11k
Entering edit mode
0

That means you have no repeated reads. I don't see any repeated read in given example.

ADD REPLYlink 22 months ago
venu
6.2k
• updated 22 months ago
geek_y
9.7k
Entering edit mode
0

The command did what it supposed to do. You have repeated substrings, but not repeated strings. For example, "ACCCTGTAGATCCGAATTTGT" is repeated 4 times in other sequences but as a substring.

ADD REPLYlink 22 months ago
geek_y
9.7k
Entering edit mode
0

Yes, I was doing something silly, but I've fixed it and there should now be repeats, as the 4th item in each line, how would I go about making a count matrix from this? Sorry, i'm a novice

ADD REPLYlink 22 months ago
rmash
• 0
Entering edit mode
0

Did you derive this strange file from your alignments in SAM/BAM format? Since you are dealing with miRNA (which should have no gaps) why not just count the instances of the miRNA a read is aligned to (in third field of your alignment file) to get the count matrix.

You are interested in miRNA counts and not read counts, is that correct?

ADD REPLYlink 22 months ago
genomax
68k
Entering edit mode
0

yes thats what id like to do but dont know how to in unix

ADD REPLYlink 22 months ago
rmash
• 0
Entering edit mode
0

thats correct, have edited main post to make it clearer

ADD REPLYlink 22 months ago
rmash
• 0
0
Entering edit mode

This seems to work

awk '{seen[$4]++} END{for(x in seen) print x, seen[x]}' test.txt > test_counts.txt

Is there a better option?

ADD COMMENTlink 22 months ago rmash • 0
Entering edit mode
0

Better in what terms? If it gives you the right answer then this is fine.

Just to be safe I suggest that you do something similar on your original SAM file so there are less chance of errors.

ADD REPLYlink 22 months ago
genomax
68k
0
Entering edit mode
 $ datamash -s -g 4 count 4 < test.list 
Mus_musculus_tRNA-Gly-GCC-2-2   1
Mus_musculus_tRNA-Gly-GCC-2-7   1
Mus_musculus_tRNA-Gly-GCC-4-1   2
mmu-mir-10a 1
mmu-mir-10b 2
other   2
tRNA-His    1

test.list is data from OP and first column is serial numbers from 1 to 10 from OP.

ADD COMMENTlink 22 months ago cpad0112 11k

Login before adding your answer.

Powered by the version 1.8