Question

how to extract genes from a fasta file in groups of 2, query-object, and store the "couple" in different files

0

Entering edit mode

6.7 years ago

bio90029 ▴ 10

Hi, I am running out of ideas to do this, and I will appreciate some help, please.I have 2 fasta files from two different bacterial strains with 1000 genes each. An example of files:

file A:                           fileB
    query seq.id                    query seq.id
    query seq                        query seq.id
    objA seq.id                     objB seq.id
    objA seq                          objB seq
    query_1 seq.id                query_1 seq.id                                              
    query_1 seq                    query_1 seq
    obj_1A seq.id                obj_1B seq.id
    obj_1A seq                    obj_1B seq

What I would like to do is to get it this:

   file_1                  file_2
 query seq.id          query_1 seq.id
  query seq              query_1 seq
obj seq.id                obj_1A seq. id
  obj seq                   obj_A seq
 objB seq.id             obj_1B seq.id
 objB seq                  obj_1B seq

But I just dont know how to split the fasta files. I was trying to do this using biopython SeqIO but I am quite lost.

python biopython • 1.4k views

ADD COMMENT • link 6.7 years ago by bio90029 ▴ 10

1

Entering edit mode

If I understand correctly, you have 2 large .fa files, of which individual entries you would like to split to separate folders?

ADD REPLY • link 6.7 years ago by Bioaln ▴ 360

0

Entering edit mode

In fact, I have 100 fasta files that contained about 1000 genes, but if I manage to do it for 2, and will manage for all. Each file contained, the query reference genes, and the object gene they match. What I would like to do is to split the fasta files or to short them out in the way that I have one file per query gene with all the matching object genes.

ADD REPLY • link 6.7 years ago by bio90029 ▴ 10

0

Entering edit mode

Please show the Biopython code you're trying, with errors, and we can help correct any errors. This would be the most beneficial for you, as a learning experience, and also keep s/o from writing the code for you as Biostars is not a coding service.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

I was trying to do this using biopython SeqIO but I am quite lost.

Post what you have tried. Also, your use of terms here is a little confusing. Are there '>' in our fasta file headers, and just not represented here? Perhaps post a cpl of examples from you files, if you can share them.

ADD REPLY • link 6.7 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Yes, the genes id all containg the '>' .

for file in files:  
    my_file=glob.glob(file + '/*.fa')
    #print my_file[1]
    filename='outfile%s.fasta'
    records=list(SeqIO.parse(my_file[1], 'fasta'))
    query_gene=records[0].id
    print query_gene
    for record in  range (0,len(records),2):
        with open(os.path.join('/path/files/', filename% query_gene), 'w') as output_handle:
            SeqIO.write(records, ouput_handle, 'fasta')
sorting_fasta_files()

But I don't get the right ouput. In fact it places all the genes in the new fasta file when I only want 2 genes per file.

ADD REPLY • link 6.7 years ago by bio90029 ▴ 10

score 1 · Accepted Answer · 2017-08-25

1

Entering edit mode

6.7 years ago

bio90029 ▴ 10

The answer is this little biopython script that I am going to post the link in case someone else is in the need to do the same than me. [http://biopython.org/wiki/Split_large_file][1]

ADD COMMENT • link 6.7 years ago by bio90029 ▴ 10