How to get Python code to run correctly for hetero-dimerization analysis
0
0
Entering edit mode
3.0 years ago
Apex92 ▴ 280

I have a code that gives output in a format I want however the results it shows are not 100% correct.

I have two big fast files which look like below (information after # are not in the actual file - I put them i this question for better understanding):

f1.fa

>seq1
ATCGCGT  #--> 6mers are ATCGCG and TCGCGT // reverse complements are CGCGAT and ACGCGA
>seq20
CCATGAA  #--> 6mers are CCATGA and CATGAA // reverse complements are TCATGG and TTCATG
>seq3
AACATGA  #--> 6mers are AACATG and ACATGA // reverse complements are CATGTT and TCATGT

f2.fa

>seq9
NNCGCGATNNCGCGATNN
>seq85
NNTTCATG

what I want to do is:

  1. to read 6mers of reach sequence from file1
  2. make reverse complements of 6mers
  3. look for any sequence (these are target sequences) in file2 to see if any of the reverse complement 6mers are inside then mark those sequences against each other with 1 else with 0
  4. report the result as a table.

so based on the example sequences above - one of the reverse complement 6mers from seq1 (CGCGAT) is inside seq9 of file2 - also one of the reverse complement 6mers from seq20 (TTCATG) is inside seq85 of file2.

so the desired output should be as below:

         seq1  seq20  seq3
name                    
seq9      1      0     0
seq85     0      1     0

but currently, i get:

         seq1  seq20  seq3
name                    
seq9      0      0     0
seq85     0      1     0

My code is:

from Bio import SeqIO
import pandas as pd


def same_seq(a_record, brecord):
    window_size = 7
    for j in range(0, len(a_record.seq) - window_size + 1):
        for k in (a_record.seq.reverse_complement()[j: j + 6].split()):
            return brecord.seq.find(k) != -1


if __name__ == '__main__':
    records = list(SeqIO.parse("f1.fa", "fasta"))
    target_records = list(SeqIO.parse("f2.fa", "fasta"))
    rows_list = []
    for target_record in target_records:
        new_row = {'name': target_record.name}
        for record in records:
            if same_seq(record, target_record):
                new_row[record.name] = 1
            else:
                new_row[record.name] = 0
        rows_list.append(new_row)
    df = pd.DataFrame(rows_list)
    df = df.set_index(["name"])
    print(df)

I am not sure if the first part or the second part of the code is causing the problem. It would be great if you could help me to fix my code. Thank you

python sequence programming FASTA • 816 views
ADD COMMENT
1
Entering edit mode
3.0 years ago

code:

#! /usr/bin/env python3
import os
import numpy as np
import pandas as pd
from Bio import SeqIO
from Bio.Seq import Seq


records = list(SeqIO.parse("file4.fa", "fasta"))
target_records = list(SeqIO.parse("file3.fa", "fasta"))

window_size = 6
step_size = 1

match_seqs = []

for i in records:
    for j in range(0, len(i.seq)-window_size+1):
        for k in (i.seq.reverse_complement()[j: j+window_size].split()):
            for l in target_records:
                match_seqs.append ([i.id, l.id, 1] if l.seq.find(k) != -1 else [i.id, l.id, 0])

df = (pd.DataFrame(match_seqs, columns=['file1', 'file2', "match"])
      .groupby(['file1', 'file2']).max('match').reset_index()
      .pivot(index="file1", columns="file2", values="match"))
print(df)

output:

file2  seq1  seq20  seq3
file1                   
seq85     0      1     0
seq9      1      0     0

Note. Be mindful of tabs in copy/pasting the code.

ADD COMMENT
0
Entering edit mode

Thank you very much :)

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6