Question

How to get Python code to run correctly for hetero-dimerization analysis

0

Entering edit mode

3.0 years ago

Apex92 ▴ 280

I have a code that gives output in a format I want however the results it shows are not 100% correct.

I have two big fast files which look like below (information after # are not in the actual file - I put them i this question for better understanding):

f1.fa

>seq1
ATCGCGT  #--> 6mers are ATCGCG and TCGCGT // reverse complements are CGCGAT and ACGCGA
>seq20
CCATGAA  #--> 6mers are CCATGA and CATGAA // reverse complements are TCATGG and TTCATG
>seq3
AACATGA  #--> 6mers are AACATG and ACATGA // reverse complements are CATGTT and TCATGT

f2.fa

>seq9
NNCGCGATNNCGCGATNN
>seq85
NNTTCATG

what I want to do is:

to read 6mers of reach sequence from file1
make reverse complements of 6mers
look for any sequence (these are target sequences) in file2 to see if any of the reverse complement 6mers are inside then mark those sequences against each other with 1 else with 0
report the result as a table.

so based on the example sequences above - one of the reverse complement 6mers from seq1 (CGCGAT) is inside seq9 of file2 - also one of the reverse complement 6mers from seq20 (TTCATG) is inside seq85 of file2.

so the desired output should be as below:

         seq1  seq20  seq3
name                    
seq9      1      0     0
seq85     0      1     0

but currently, i get:

         seq1  seq20  seq3
name                    
seq9      0      0     0
seq85     0      1     0

My code is:

from Bio import SeqIO
import pandas as pd


def same_seq(a_record, brecord):
    window_size = 7
    for j in range(0, len(a_record.seq) - window_size + 1):
        for k in (a_record.seq.reverse_complement()[j: j + 6].split()):
            return brecord.seq.find(k) != -1


if __name__ == '__main__':
    records = list(SeqIO.parse("f1.fa", "fasta"))
    target_records = list(SeqIO.parse("f2.fa", "fasta"))
    rows_list = []
    for target_record in target_records:
        new_row = {'name': target_record.name}
        for record in records:
            if same_seq(record, target_record):
                new_row[record.name] = 1
            else:
                new_row[record.name] = 0
        rows_list.append(new_row)
    df = pd.DataFrame(rows_list)
    df = df.set_index(["name"])
    print(df)

I am not sure if the first part or the second part of the code is causing the problem. It would be great if you could help me to fix my code. Thank you

python sequence programming FASTA • 816 views

ADD COMMENT • link updated 3.0 years ago by cpad0112 21k • written 3.0 years ago by Apex92 ▴ 280

score 1 · Answer 1 · 2021-04-10

code:

#! /usr/bin/env python3
import os
import numpy as np
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq


records = list(SeqIO.parse("file4.fa", "fasta"))
target_records = list(SeqIO.parse("file3.fa", "fasta"))

window_size = 6
step_size = 1

match_seqs = []

for i in records:
    for j in range(0, len(i.seq)-window_size+1):
        for k in (i.seq.reverse_complement()[j: j+window_size].split()):
            for l in target_records:
                match_seqs.append ([i.id, l.id, 1] if l.seq.find(k) != -1 else [i.id, l.id, 0])

df = (pd.DataFrame(match_seqs, columns=['file1', 'file2', "match"])
      .groupby(['file1', 'file2']).max('match').reset_index()
      .pivot(index="file1", columns="file2", values="match"))
print(df)

output:

file2  seq1  seq20  seq3
file1                   
seq85     0      1     0
seq9      1      0     0

Note. Be mindful of tabs in copy/pasting the code.