solved :use pandas dataframe column result to slice data in other columns
0
1
Entering edit mode
4.9 years ago
toth.joe ▴ 30

I want a workflow to take raw sequence reads as a multi-entry fasta, then process them with hmmer and regex patterns to yield trimmed gene DNA and protein sequences.

I am able to assemble a dataframe with the raw DNA, the upstream trimmed DNA and protein. For the downstream trimming, I need to use a regex pattern match to find the downstream trim location. The problem is my 'endloc' variable is a float and can't be used for slicing the other dataframe columns.

The top code here runs a function that gives an end location value. No matter what I try the number is a float in the dataframe. Below are the two unsuccessful approaches I tried to slice as DNA and protein. Another problem is that I can't convert the DNA count into protein count by division here.

   def regex_filter(val):
    try:
        return int(3*re.search(endRegex, str(val)).start())
    except Exception:
        pass

#find Vregion protein sequence end location and write new column
df_result['endloc'] = df_result['Vstart_protein'].apply(regex_filter).dropna().astype(int)

#trim Vregion protein sequence and write new column
#df_result['protein'] = df_result.apply(lambda x: x['Vstart_protein'][x:['endloc']/3], 1)

#trim Vregion DNA sequence and write new column
df_result['DNA'] = df_result.Vstart.str[:['endloc']]

How can I get the result of the regex pattern calculation into a dataframe column in a way that I can slice other columns with untrimmed DNA and protein sequence?

After looking at my code I came up with some solutions

def regex_filter(val):
try:
    return int(3*re.search(endRegex, str(val)).start())
except Exception:
    return 1

function now returns 1 instead of NaN. Pandas will convert a column of ints to float if there is an NaN

#remove rows that didn't have a REGEX match
df_result = df_result[df_result['endloc'] != 1]
df_result = df_result.astype({'endloc': int})

Remove rows with 1 and make the column ints

There was a typo in my lambda function. Here is the correct syntax

#trim Vregion protein sequence and write new column
df_result['DNA'] = df_result.apply(lambda x: x['Vstart'][:x['endloc']], 1)
python pandas • 1.9k views
ADD COMMENT
0
Entering edit mode

Please post your answer as an answer, see below - Add your answer

ADD REPLY

Login before adding your answer.

Traffic: 1520 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6