Question

Parsing dataframe and sort a fasta file

0

Entering edit mode

5.9 years ago

Chvatil ▴ 130

Hi all, I need some help. In fact in the contexte of my work I have a dataframe of seq names paired together, in each row there is one column for the seq_id of the sp1 and one fore the seq _idof the sp2. In another hand I have two fasta files which contain all these sequences (same seq Id) + the sequences in fasta format. But in these files sequences are totaly mixed and what I need to do is to reorganize two new fasta file by parsing my dataframe and say, ok for each row, put the seqx_A in fasta file 1 and seqx_B in fasta file 2. By keeping the order in the dataframe. Here is an exemple:

I actually have one dataframe with sequences in order such :

Seq_1.id    Seq_2.id
seq1_A     seq8_B
seq2_A     Seq9_B
seq3_A     Seq10_B
seq4_A     Seq11_B

and two fasta files such : first one:

>Seq11_B
ACTG
>seq8_B
ATGC
>seq3_A
ACTG
>seq2_A
ATGC

second one: 
>seq4_A
 ACTG
>seq1_A
 ACTG
>Seq10_B
 ATGC
>Seq9_B
 ATCG

As you can see _A and _B are mixed in bot fasta file but I would like to order my fastafiles by creating a new ones and put all seq A in a file and all seqB in another file in the same order as in the dataframe (paires sequence as always to be added in the same time in the file). here would be the output of the exemple:

fasta1:

>seq1_A
ATGC
>seq2_A 
ATGG
>seq3_A 
ATGC
>seq4_A 
ATGC

and fasta2:

>seq8_B
ATGc
>Seq9_B
ATGC
>Seq10_B
ATGC
>Seq11_B
ATGC

Here would be the name of the files:

candidate_df.read_csv("dn_ds.out_test",sep='\t')

#--------------------------------------
#Load the sequences comming from the cluster filtering and range them into ordered files per species

#here is the two columns of the dataframe
seq1_id=candidate_df["seq1_id"]
seq2_id=candidate_df["seq2_id"]

#Here is the output desired files:
output_aa_sp1 = open('candidates_aa_0042.fasta','w')
output_aa_sp2 = open('candidates_aa_0035.fasta','w')


#Here are the 2 fasta file to be modified
record_dict_sp1_aa = SeqIO.to_dict(SeqIO.parse("result1_aa.fasta", "fasta"))
record_dict_sp2_aa = SeqIO.to_dict(SeqIO.parse("result2_aa.fasta", "fasta"))

Does someone have an idea?

Thank you :)

python pandas fasta • 9.2k views

ADD COMMENT • link updated 5.9 years ago by Bastien Hervé 5.3k • written 5.9 years ago by Chvatil ▴ 130

2

Entering edit mode

5.9 years ago

shenwei356 8.4k

You've finished half the code, next step is just iterating the seq1_id and check if one id exists in any of record_dict_sp1_aa or record_dict_sp2_aa and printing the result.

Actually, there's no need to write scripts.

# concat two files

cat f1.fasta f2.fasta > all.fasta

# export ids for seq1 and seq2

cut -f 1 id.tsv | sed 1d > seq1_id.txt
cut -f 2 id.tsv | sed 1d > seq2_id.txt

# extract seqs by every id in ids file, using samtools faidx

cat seq1_id.txt  | while read id; do samtools faidx all.fasta $id; done > result1.fasta
cat seq2_id.txt  | while read id; do samtools faidx all.fasta $id; done > result2.fasta

# verify

$ grep '>' result1.fasta 
>seq1_A
>seq2_A
>seq3_A
>seq4_A

$ grep '>' result2.fasta 
>seq8_B
>Seq9_B
>Seq10_B
>Seq11_B

ADD COMMENT • link 5.9 years ago by shenwei356 8.4k

0

Entering edit mode

Hi, thank you, I think it will be usefull for other peoples but I actually need to code it with python because all my pipeline is coded with it..

ADD REPLY • link 5.9 years ago by Chvatil ▴ 130

0

Entering edit mode

I've just updated the answer.

You've finished half the code, next step is just iterating the seq1_id and check if one id exists in any of record_dict_sp1_aa or record_dict_sp2_aa and printing the result.

ADD REPLY • link 5.9 years ago by shenwei356 8.4k

0

Entering edit mode

I wrote an answer to show you what I tried

ADD REPLY • link 5.9 years ago by Chvatil ▴ 130

0

Entering edit mode

5.9 years ago

Chvatil ▴ 130

Hi tried this :

for a, b in zip(seq1_id,seq2_id): 
    if a in record_dict_sp1_aa:

        print(">",record_dict_sp1_aa[a].id,file=output_aa_sp1,sep="")
        print(record_dict_sp1_aa[a].seq,file=output_aa_sp1)

        print(">",record_dict_sp2_aa[b].id,file=output_aa_sp2,sep="")
        print(record_dict_sp2_aa[b].seq,file=output_aa_sp2)

    else:

        print(">",record_dict_sp2_aa[a].id,file=output_aa_sp1,sep="")
        print(record_dict_sp2_aa[a].seq,file=output_aa_sp1)

        print(">",record_dict_sp1_aa[b].id,file=output_aa_sp2,sep="")
        print(record_dict_sp1_aa[b].seq,file=output_aa_sp2)

But in my first dataframa I have 196 rows, so there should be 196 seq in the fasta1 and 196 in the fasta2 but I actually get 191 in fasta1 and 184 in fasta2.. And I check theses sequences are well present in the two first fasta files...

ADD COMMENT • link 5.9 years ago by Chvatil ▴ 130

0

Entering edit mode

Bad logic. You need to iterate seq1_id and seq2_id separately.

ADD REPLY • link 5.9 years ago by shenwei356 8.4k

1

Entering edit mode

for id in seq1_id:
    if id in record_dict_sp1_aa:
        # output record_dict_sp1_aa[id]
        pass
    elif id in record_dict_sp2_aa:
        # output record_dict_sp2_aa[id]
        pass
    else:
        # reporting a id not in any files

for id in seq2_id:
    pass

ADD REPLY • link 5.9 years ago by shenwei356 8.4k

score 3 · Accepted Answer · 2018-05-18

3

Entering edit mode

5.9 years ago

Bastien Hervé 5.3k

Biopython is powerful !

import pandas as pd
from Bio import SeqIO

output_handle1 = open("new_fasta1.fasta", "a")
output_handle2 = open("new_fasta2.fasta", "a")

records1 = SeqIO.index("fasta1.fasta", "fasta")
records2 = SeqIO.index("fasta2.fasta", "fasta")

candidate_df=pd.read_csv("dispatch.csv",sep='\t')

for i in candidate_df['Seq_1.id']:
    if i in records1:
        SeqIO.write(records1[i], output_handle1, 'fasta')
    elif i in records2:
        SeqIO.write(records2[i], output_handle1, 'fasta')

for i in candidate_df['Seq_2.id']:
    if i in records1:
        SeqIO.write(records1[i], output_handle2, 'fasta')
    elif i in records2:
        SeqIO.write(records2[i], output_handle2, 'fasta')

ADD COMMENT • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

Hi think it is working, do you know how can I count how many record I have in a fasta file with biopython?

ADD REPLY • link 5.9 years ago by Chvatil ▴ 130

0

Entering edit mode

You mean in new_fasta1 and new_fasta2 ?

Put a count for each file

Or in general ?

records = SeqIO.index("your_file.fasta", "fasta")
print(len(records))

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

BioPython is powerful but slow and, perhaps, overkill for this task.

ADD REPLY • link 5.9 years ago by Alex Reynolds 35k

0

Entering edit mode

Indeed, if time really matter to him, he should swap to high level language as C. But he mentions python so I gave him this solution. He also could have write something in python from scratch like you did but i'm not even sure that would be quicker. I know, by testing, that Biopython is memory consuming. I'm just curious, do you have a paper on hand on this biopython downside ? Thanks !

ADD REPLY • link 5.9 years ago by Bastien Hervé 5.3k

0

Entering edit mode

I do not have a paper on this, I'm sorry. I'm just relating past (anecdotal) experience processing FASTA files with this library.

ADD REPLY • link 5.9 years ago by Alex Reynolds 35k

score 1 · Accepted Answer · 2018-05-18

If you must use Python, it's a two-step procedure. First, write the headers into ordered lists ("arrays", but really lists) and then use a hash table/dictionary to lookup sequences from the headers in the ordered lists.

First, read the columns of the first file into a list:

seq1 = []
seq2 = []
with open("ordered_sequences.txt", "r") as osh:
  # skip header
  osh.readline() 
  for line in osh:
    elems = line.rstrip().split('\t')
    seq1.append(elems[0])
    seq2.append(elems[1])

Once you have built seq1 and seq2, read each FASTA file separately into a hash table ("dictionary" in Python-speak).

For example, for the first FASTA file first.fa:

records = {}
sequence = ""
with open("first.fa", "r") as fh:
  for line in fh:
    line = line.rstrip()
    if line.startswith('>'):
      if len(sequence) > 0:
        records[header] = sequence
        sequence = ""
      header = line[1:]
    else
      sequence += line
# read last record into dictionary
records[header] = sequence

(Repeat for the second FASTA file, to a separate hash table or run instance.)

Then write out records from the hash table records in order, using the headers in the ordered lists.

For example, using headers in seq1:

for seq in seq1:
  sys.stdout.write(">%s\n%s\n" % (seq, records[seq]))

Or for seq2, say:

for seq in seq2:
  sys.stdout.write(">%s\n%s\n" % (seq, records[seq]))

Use the redirection operator > when running the Python script to write standard output to a file:

$ python someScript.py > orderedOutput.fa

Or use open and write to a writable file handle.