Question

how to split a big embl files based on IDs

1

Entering edit mode

6.6 years ago

Mehmet ▴ 820

Dear All,

I have a big embl file that has over 1000 IDs, and I want to split it based on IDs. Each ID should be a separate file containing all information of related ID, and should have the ID name.

How to do that?

sequence gene Assembly • 2.6k views

ADD COMMENT • link updated 6.6 years ago by cpad0112 21k • written 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Split one embl file into several

ADD REPLY • link 6.6 years ago by lessismore ★ 1.3k

0

Entering edit mode

I know this. What I want is to split big embl file and write output of each ID in a output file based on ID.

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

score 2 · Accepted Answer · 2017-10-26

2

Entering edit mode

6.6 years ago

Alex Reynolds 35k

One option:

$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./in.embl

Given test input:

$ more test.embl
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03519 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
ID AA03521 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

Run the script:

$ awk -v RS="ID " -v FS="\n" {'n = split($1, a, ";"); sub(/ /, "_", a[1]); a[1] = a[1]".embl"; outFn = a[1]; record = "ID "$0; printf("%s", record) > outFn;'} ./test.embl

This creates four files, one for each record, named by the record's ID:

$ ls -al AA*.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03518_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03519_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03520_standard.embl
-rw-r--r--  1 areynolds  wheel   519 Oct 26 15:01 AA03521_standard.embl

For example:

$ more AA03520_standard.embl 
ID AA03520 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
   aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
   tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120 
   ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
   tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//

If IDs are not uniquely namable, then this could cause previous results to get overwritten. It may be safer to use >> instead of > within the awk statement.

ADD COMMENT • link 6.6 years ago by Alex Reynolds 35k

0

Entering edit mode

the script "$ awk -v RS="ID... is exactly what I need. I don't know how many ID is found in my embl file. So, how can I split each ID using the script?

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

1

Entering edit mode

This awk script will create how ever many files as there are ID lines, at most. See the use of redirection operator > inside the awk script. Each record is written to its own file (so long as the record ID is unique).

If you want to count ahead of time how many ID files you should expect to come out of this script, you could run grep /$ID/ in.embl | wc -l.

The grep statement returns lines that start with ID, and wc -l counts them.

ADD REPLY • link 6.6 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you very much. I generated, but each embl file stars with _ symbol. how can I generate embl file without starting _ symbol ?

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

It would help to look at your EMBL-formatted file and see what changes are needed to the awk script.

Can you post the first 10 ID lines to your question? You could run something like:

$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt

Then edit your question and add the contents of firstTenIDs.txt to it, so that we can see what you're working with.

ADD REPLY • link 6.6 years ago by Alex Reynolds 35k

0

Entering edit mode

This is one of embl files produced after the script.

ID scaffold00001 scaffold00001; ; ; DNA; ; UNC; 6279 BP. XX AC scaffold00001; XX DE scaffold00001 XX OS . OC . XX FH Key Location/Qualifiers FT CDS join(371..871,936..953) FT /source="EVM" FT /locus_tag="species1_s00001.1" FT CDS join(2234..2326,2407..2652,2880..3080) FT /source="EVM" FT /locus_tag="species1_s00001.2" FT CDS complement(join(5766..5891,5280..5603,4956..5138)) FT /source="EVM" FT /locus_tag="species1_s00001.3" SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300

List item

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

But I would like to have as this one:

ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP. XX FH Key Location/Qualifiers FH FT CDS 371..871 FT CDS 936..953 FT CDS 371..953 FT CDS 2234..2326 FT CDS 2407..2652 FT CDS 2880..3080 FT CDS complement(4956..5138) FT CDS complement(5280..5603) FT CDS complement(5766..5891) FT CDS 2234..3080 FT CDS complement(4956..5891) XX SQ Sequence 6279 BP; 1875 A; 1232 C; 1136 G; 1736 T; 300 other; TATTCGTATt cAGAAGAAGA GCAGCTGAGT AGTCATCAAA GCGAAGAAAT AAAGTTGCCG 60 GCCCAAGTAG TATAGTGGTT AGTATTCCAC GTTGTGGCCG TGGCGACACA GGTTCGATTC 120 CTGTCTTGGG CAAGTGTtCT TTTTtGAGTT ACAgTtCGGG GGgAACTAAC CcAGAaTTGT 180 TTATTGAGGA TGGTCGTATT TtACTAtACT TTTTTtAAAA GACCGCGAAA ATTTATTTTC 240 AAATCTGTTG CAAAATTTTA TGAGATAACT TTTTCCAATT TCTTTtGGGG TTTTCAGCCC 300

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

also before each output file, there are two character blanks. For instance, blank blank contig09311.embl

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Sorry, I write this as a mistake. Please do not consider this file.

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

how to remove blank before name of each embl file? blank blank contig09311.embl

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

there are two letter space in from of outputted embl file. How can I delete the space?

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Can you post the first 10 ID lines to your question? You could run something like:

$ grep /$ID/ in.embl | head -10 > firstTenIDs.txt

Then edit your question and add the contents of firstTenIDs.txt to it, so that we can see what you're working with.

You can add four spaces in front of each line to put the text into “code” form.

ADD REPLY • link 6.6 years ago by Alex Reynolds 35k

0

Entering edit mode

I typed this command; grep /$ID/ in.embl | head -10 > firstTenIDs.txt but the output is empty

  contig03585.embl

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

Please post your file somewhere it can downloaded and I’ll take a look.

ADD REPLY • link 6.6 years ago by Alex Reynolds 35k

0

Entering edit mode

in the awk script, there is something that causes blank in from of each outputted embl file.

for instance, assume that xx in front of the .embl file are blank line. I just need to remove blanks in from of the outputted embl file. xxcontig01701.embl

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

0

Entering edit mode

sorry. When I run awk script, each outputted embl file looks like below:

_ contig00436.embl

how to remove _ in front of output files?

ADD REPLY • link 6.6 years ago by Mehmet ▴ 820

score 2 · Accepted Answer · 2017-10-27

@OP: please post first embl record for better parsing.

Note: Before proceeding, create a test directory, copy the original embl (test.embl below) and run the script and check the output for random files.

OP embl ID line:

ID scaffold00001; SV 1; linear; unassigned DNA; STD; UNC; 6279 BP.

based on OP ID line,

$ grep -i "ID" test.embl |  awk '{gsub(";",""); print $2}' |  parallel "awk '/{}/,/\/\//' test.embl > {}.embl"

Above script should create multiple embl files with ID as file name.

Example embl file from post Split one embl file into several (close to OP embl IDs):

$ cat test.embl 
ID   comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[1:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
     GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA        60
     GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT       120
     TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
ID   comp0_c0_seq2; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[4094:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 59 A; 50 C; 35 G; 61 T; 0 other;
     AGAGTATTAA ATGTTGCAGT TCAGTGCTTA AAATTTATTG GATTCAGAGA ATCTTCAAAT        60
     TCAACGGACC CTAAACACTG AGCTGTCGCA TTAAATGCTC TGCAGTTCAA TGCTTAGCTT       120
     TCACCATTAA GCATAGCACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
ID   comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE   len=244 path=[3:0-88 875:89-243]
XX
SQ   Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
     GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC        60
     GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA       120
     TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA       180
     CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA       240
     GGCT                                                                    244
//

output:

$ ls *.embl
comp0_c0_seq1.embl  comp0_c0_seq2.embl  comp1_c0_seq1.embl  test.embl

.

$ more comp0_c0_seq1.embl 
ID   comp0_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 205 BP.
XX
DE   len=205 path=[1:0-135 1445:136-204]
XX
SQ   Sequence 205 BP; 64 A; 54 C; 31 G; 56 T; 0 other;
     GTATTGAACT GCAGAGCATT AAATGCTGCA ACTCAGTGCT TAGAATTCAT TAGATTCAGA        60
     GCAACGAACC CTAAATACTG AGCTGTCCCA TTAAATACTC TGCAGTTCAA TACTTAGCAT       120
     TCACCATTAA ACATAACACT TCCCGAGTTT CCACCATCCA TAAACAGCAG GCATTGTAAC       180
     CTGTAGGCTC TCTCCACGGT TACCT                                             205
//
$ more comp1_c0_seq1.embl 
ID   comp1_c0_seq1; SV 1; linear; unassigned DNA; STD; UNC; 244 BP.
XX
DE   len=244 path=[3:0-88 875:89-243]
XX
SQ   Sequence 244 BP; 71 A; 51 C; 63 G; 59 T; 0 other;
     GCAGAATTTA AGGCTATGAA TCAGGAGGTT CATAATTCCT TAAGGAGGGG AGTATGATGC        60
     GGAGCATCCA CGCTCACCTC CACTCCACCG CATTGTCTTC GAGCTGTGAC AGCCAGCGCA       120
     TAATATTCAA GAGCTATTGA CAGGTGTTGA AACGCGGCAG CCTTGCATAC TATTGAAGGA       180
     CCACGTTTCA TTATTGTGAT CTATAAGAAG ACAGCTGATG CGATCATGAG GAAGGAAGAA       240
     GGCT                                                                    244
//

List item