I want a workflow to take raw sequence reads as a multi-entry fasta, then process them with hmmer and regex patterns to yield trimmed gene DNA and protein sequences.
I am able to assemble a dataframe with the raw DNA, the upstream trimmed DNA and protein. For the downstream trimming, I need to use a regex pattern match to find the downstream trim location. The problem is my 'endloc' variable is a float and can't be used for slicing the other dataframe columns.
The top code here runs a function that gives an end location value. No matter what I try the number is a float in the dataframe. Below are the two unsuccessful approaches I tried to slice as DNA and protein. Another problem is that I can't convert the DNA count into protein count by division here.
def regex_filter(val):
try:
return int(3*re.search(endRegex, str(val)).start())
except Exception:
pass
#find Vregion protein sequence end location and write new column
df_result['endloc'] = df_result['Vstart_protein'].apply(regex_filter).dropna().astype(int)
#trim Vregion protein sequence and write new column
#df_result['protein'] = df_result.apply(lambda x: x['Vstart_protein'][x:['endloc']/3], 1)
#trim Vregion DNA sequence and write new column
df_result['DNA'] = df_result.Vstart.str[:['endloc']]
How can I get the result of the regex pattern calculation into a dataframe column in a way that I can slice other columns with untrimmed DNA and protein sequence?
After looking at my code I came up with some solutions
def regex_filter(val):
try:
return int(3*re.search(endRegex, str(val)).start())
except Exception:
return 1
function now returns 1 instead of NaN. Pandas will convert a column of ints to float if there is an NaN
#remove rows that didn't have a REGEX match
df_result = df_result[df_result['endloc'] != 1]
df_result = df_result.astype({'endloc': int})
Remove rows with 1 and make the column ints
There was a typo in my lambda function. Here is the correct syntax
#trim Vregion protein sequence and write new column
df_result['DNA'] = df_result.apply(lambda x: x['Vstart'][:x['endloc']], 1)
Please post your answer as an answer, see below - Add your answer