how to extract genes from a fasta file in groups of 2, query-object, and store the "couple" in different files
1
0
Entering edit mode
6.7 years ago
bio90029 ▴ 10

Hi, I am running out of ideas to do this, and I will appreciate some help, please.I have 2 fasta files from two different bacterial strains with 1000 genes each. An example of files:

file A:                           fileB
    query seq.id                    query seq.id
    query seq                        query seq.id
    objA seq.id                     objB seq.id
    objA seq                          objB seq
    query_1 seq.id                query_1 seq.id                                              
    query_1 seq                    query_1 seq
    obj_1A seq.id                obj_1B seq.id
    obj_1A seq                    obj_1B seq

What I would like to do is to get it this:

   file_1                  file_2
 query seq.id          query_1 seq.id
  query seq              query_1 seq
obj seq.id                obj_1A seq. id
  obj seq                   obj_A seq
 objB seq.id             obj_1B seq.id
 objB seq                  obj_1B seq

But I just dont know how to split the fasta files. I was trying to do this using biopython SeqIO but I am quite lost.

python biopython • 1.4k views
ADD COMMENT
1
Entering edit mode

If I understand correctly, you have 2 large .fa files, of which individual entries you would like to split to separate folders?

ADD REPLY
0
Entering edit mode

In fact, I have 100 fasta files that contained about 1000 genes, but if I manage to do it for 2, and will manage for all. Each file contained, the query reference genes, and the object gene they match. What I would like to do is to split the fasta files or to short them out in the way that I have one file per query gene with all the matching object genes.

ADD REPLY
0
Entering edit mode

Please show the Biopython code you're trying, with errors, and we can help correct any errors. This would be the most beneficial for you, as a learning experience, and also keep s/o from writing the code for you as Biostars is not a coding service.

ADD REPLY
0
Entering edit mode

I was trying to do this using biopython SeqIO but I am quite lost.

Post what you have tried. Also, your use of terms here is a little confusing. Are there '>' in our fasta file headers, and just not represented here? Perhaps post a cpl of examples from you files, if you can share them.

ADD REPLY
0
Entering edit mode

Yes, the genes id all containg the '>' .

for file in files:  
    my_file=glob.glob(file + '/*.fa')
    #print my_file[1]
    filename='outfile%s.fasta'
    records=list(SeqIO.parse(my_file[1], 'fasta'))
    query_gene=records[0].id
    print query_gene
    for record in  range (0,len(records),2):
        with open(os.path.join('/path/files/', filename% query_gene), 'w') as output_handle:
            SeqIO.write(records, ouput_handle, 'fasta')
sorting_fasta_files()

But I don't get the right ouput. In fact it places all the genes in the new fasta file when I only want 2 genes per file.

ADD REPLY
1
Entering edit mode
6.7 years ago
bio90029 ▴ 10

The answer is this little biopython script that I am going to post the link in case someone else is in the need to do the same than me. [http://biopython.org/wiki/Split_large_file][1]

ADD COMMENT

Login before adding your answer.

Traffic: 1653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6