Count Hits to Each Unique Domain in Hmmscan Results w/ Python?
2
0
Entering edit mode
9.9 years ago

I'm trying to get some info about the top domains that my data matched to using hmmscan...I'm using the domain table output that is packaged with hmmscan (I also have the PFAM file). I want to use some sort of script to go through and tell me how many matches to each unique domain there are and then sort that list so that I can extract the top 15-20 domains that my data matched to. Thoughts?

hmmscan • 3.4k views
ADD COMMENT
0
Entering edit mode
9.9 years ago
5heikki 11k

Given that domain info is in column X

cut -f X yourOutputFile | sort | uniq -c | sort -k1,1g
ADD COMMENT
0
Entering edit mode
6.4 years ago
John • 0

One option is to load your hmmscan table results as a pandas.DataFrame, then counting the domains is easy with the value_counts() method:

# Here, `hmm_tbl` is a pandas.DataFrame with your hmmscan results.
# This dataframe has a column `accession_target`, which is the accession of the result in the table.
# Eg, the dataframe might look something like this (truncated for readability)
#     acc  accession_query accession_target  ali_coord_from  env_coord_from
# 0  0.91              NaN       PF00005.25              10               6
# 1  0.89              NaN        PF13304.4              33              21
hmm_tbl.accession_target.value_counts()
# PF07719.15    4668
# PF00005.25    4402
# PF13304.4     3626
# PF13432.4     3601
# PF13428.4     3513
# PF00515.26    3494
# ...

Link: pandas.Series.value_counts — pandas 0.21.0 documentation

ADD COMMENT

Login before adding your answer.

Traffic: 2970 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6