Question

solved :use pandas dataframe column result to slice data in other columns

1

Entering edit mode

4.9 years ago

toth.joe ▴ 30

I want a workflow to take raw sequence reads as a multi-entry fasta, then process them with hmmer and regex patterns to yield trimmed gene DNA and protein sequences.

I am able to assemble a dataframe with the raw DNA, the upstream trimmed DNA and protein. For the downstream trimming, I need to use a regex pattern match to find the downstream trim location. The problem is my 'endloc' variable is a float and can't be used for slicing the other dataframe columns.

The top code here runs a function that gives an end location value. No matter what I try the number is a float in the dataframe. Below are the two unsuccessful approaches I tried to slice as DNA and protein. Another problem is that I can't convert the DNA count into protein count by division here.

   def regex_filter(val):
    try:
        return int(3*re.search(endRegex, str(val)).start())
    except Exception:
        pass

#find Vregion protein sequence end location and write new column
df_result['endloc'] = df_result['Vstart_protein'].apply(regex_filter).dropna().astype(int)

#trim Vregion protein sequence and write new column
#df_result['protein'] = df_result.apply(lambda x: x['Vstart_protein'][x:['endloc']/3], 1)

#trim Vregion DNA sequence and write new column
df_result['DNA'] = df_result.Vstart.str[:['endloc']]

How can I get the result of the regex pattern calculation into a dataframe column in a way that I can slice other columns with untrimmed DNA and protein sequence?

After looking at my code I came up with some solutions

def regex_filter(val):
try:
    return int(3*re.search(endRegex, str(val)).start())
except Exception:
    return 1

function now returns 1 instead of NaN. Pandas will convert a column of ints to float if there is an NaN

#remove rows that didn't have a REGEX match
df_result = df_result[df_result['endloc'] != 1]
df_result = df_result.astype({'endloc': int})

Remove rows with 1 and make the column ints

There was a typo in my lambda function. Here is the correct syntax

#trim Vregion protein sequence and write new column
df_result['DNA'] = df_result.apply(lambda x: x['Vstart'][:x['endloc']], 1)

python pandas • 1.9k views

ADD COMMENT • link updated 4.9 years ago by Biostar 20 • written 4.9 years ago by toth.joe ▴ 30

0

Entering edit mode

Please post your answer as an answer, see below - Add your answer

ADD REPLY • link 4.9 years ago by zx8754 11k