Question

Separate Files

0

Entering edit mode

10.3 years ago

wangjununo • 0

I have some biology sequence data like follows:

>b_comp_seq1
ACGCGGGGGAATTT
>b_comp_seq_2
ACGGGCTTTCACC
.....
>b_comp_seq_64
ACCCGGGAATT

while I want extract these sequence with 4 sequence in a separate file with a name, that means I have 64 sequences I want them separate into 16 files with each have 4 sequences and with a different name. Is there any perl script or other way to do this? Thank you

perl • 3.0k views

ADD COMMENT • link updated 6.5 years ago by Biostar 20 • written 10.3 years ago by wangjununo • 0

1

Entering edit mode

Why don't you try "split". Each entry takes 2 lines (1 header and 1 for sequence). You have total of 64x2= 128 lines. Now try this command on UNIX:

split --lines 8 Original_file (It should give you 16 files)

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Unless I'm mistaken, that will not work. It's appears that the OP wants only four extracted sequences in each of the 16 files--no headers.

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

1

Entering edit mode

In that case he can make a new file using the following command: grep -v "^>" Original_file > Newfile. This file will only have sequences. Now he will have to use split --lines 4 instead of 8.

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

This is a nice solution.

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

0

Entering edit mode

There's probably not one floating around anywhere, but you could trivially write one.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

Perhaps not so 'trival' for the OP...

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

1

Entering edit mode

Then the OP should find a different field. Those who can't program at least a little have no business in bioinformatics.

ADD REPLY • link 10.3 years ago by Devon Ryan 104k

0

Entering edit mode

I don't know what "a little" means, in this context--either operationally or stipulatively. I also didn't know that bioinformatics requires a programming background for admission. If not, however, perhaps the OP is just now nurturing his/her emerging programming skills, since a certain level of proficiency ("a little"?) is required at some point in the OP's matriculation.

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

1

Entering edit mode

Well dpryan79 is not entirely wrong. He has helped a lots of beginners by answering their questions. But lately we have been getting many trivial questions. People don't try enough before posting their questions to the forum. The best thing would be to also post whatever you have tried so far along with the real question. This will show that the user has made sincere effort to resolve the problem.

ADD REPLY • link 10.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Your point is well made, and I didn't mean to suggest that dpryan79 was wrong. Even in my solution I mentioned that it would be nice to see some problem-solving attempts. I sometimes have a hard time with the use of the word 'trivial' in programming contexts, since I know that what would be 'trivial' to some programmers would make my brain hurt.

ADD REPLY • link 10.3 years ago by Kenosis ★ 1.3k

score 0 · Answer 1 · 2013-12-31

0

Entering edit mode

10.3 years ago

Pavel Senin ★ 1.9k

I think this will work for you c-code

ADD COMMENT • link 10.3 years ago by Pavel Senin ★ 1.9k

score 0 · Answer 2 · 2014-01-01

It would be good to see your solution attempts. Having said that, the following will do what you need:

use strict;
use warnings;
use autodie;

my ( $fh, $n );

while (<>) {
    open $fh, '>', 'file' . ++$n . '.txt' unless ( $. - 1 ) % 8;
    print $fh $_ unless /^>/;
}

Usage: perl script.pl inFile

And as a one-liner:

perl -ne 'open $fh, ">", "file" . ++$n . ".txt" unless ( $. - 1 ) % 8; print $fh $_ unless /^>/' inFile

score 0 · Answer 3 · 2014-01-01

0

Entering edit mode

10.3 years ago

Pierre Lindenbaum 161k

linearize the fasta lines with awk and use the linux command 'split'

$ awk '/^>/ {printf("%s%s\n",(N==0?"":"\n"),$0); ++N; next;} {printf("%s",$0);} END {printf("\n");}' input.fasta |\
split -l 8 - FASTA.