Biopython hmmscan parser
1
0
Entering edit mode
8.5 years ago

Hello everyone,

I have a hmmscan output in hmmscan3-domtab format with more then 10000 queries.

I want to filter each profile hit based on hit_coverage (hmm profile coverage).

I was scanning queryresult--> Hits -->Hsps --> HSPfragment this iterations and calculating hit_coverage then writing to another file (filtered file) using using SearchIO in biopython write method but while writing I found that all the hits were going to filtered file.

Below is the code I have used:

for qresult in SearchIO.parse(file_path,'hmmscan3-domtab'):
       pass
       for each_hit in qresult:
           pass
           for each_hsp in each_hit:
               pass
               if re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end  - each_hsp.hit_start)/each_hit.seq_len) >= 75:
                   SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')
               elif not re.match("^gfam",each_hit.id) and (100*(each_hsp.hit_end - each_hsp.hit_start)/each_hit.seq_len) >= 50:
                    SearchIO.write(qresult,filtered_file_handler,'hmmscan3-domtab')

Any suggestions for sending only filtered hits to the filtered file.

PS - I have noticed that I am writing qresult. I need some suggestions for refining on this.

Thanks,
Vijay N

hmmscan biopython SearchIO • 3.2k views
ADD COMMENT
0
Entering edit mode
7.0 years ago
O.rka ▴ 710

This is the one I wrote. I need to do some error handling for when there are no hits b/c it will break. I put stuff like this in my Jupyter profile and it speeds stuff up... a lot.

def read_hmmscan(path, set_index="query_name", sort_values="sequence|_|e-value"):
    # Build pd.DataFrame
    df_tmp = pd.read_table(path, header=None, index_col=0)
    mask_idx = df_tmp.index.map(lambda x: x.startswith("#") == False)
    tmp_list = df_tmp.loc[mask_idx,:].index.map(lambda x: [y for y in x.split(" ") if (len(y) > 0) and (y != "-")]).tolist()
    data = list()
    for line in tmp_list:
        data.append(pd.Series(line))
    df_hmmscan = pd.DataFrame(data)
    # Labels
    id_cols = ["target_name", "query_name"]
    sequence_cols = ["sequence|_|e-value", "sequence|_|score", "sequence|_|bias"]
    domain_cols = ["domain|_|e-value", "domain|_|score", "domain|_|bias"]
    misc_cols = ["exp", "reg", "clu", "ov", "env", "dom", "rep", "inc"]
    df_hmmscan.columns  = id_cols + sequence_cols + domain_cols + misc_cols
    # Force types
    df_hmmscan.loc[:,sequence_cols + domain_cols] = df_hmmscan.loc[:,sequence_cols + domain_cols].astype(float)
    df_hmmscan.loc[:,"exp"] = df_hmmscan.loc[:,"exp"].astype(float)
    df_hmmscan.loc[:,misc_cols[1:]] = df_hmmscan.loc[:,misc_cols[1:]].astype(int)
    # Reformat structure
    if set_index:
        df_hmmscan = df_hmmscan.set_index(set_index)
    if sort_values:
        df_hmmscan = df_hmmscan.sort_values(sort_values, ascending=True)
    return df_hmmscan
ADD COMMENT

Login before adding your answer.

Traffic: 2345 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6