Question

Using Perl Text::Wrap For Wrapping Text

2

Entering edit mode

10.5 years ago

The ▴ 180

Hi All, I'm using Perl's Text::Wrap module to break a long DNA string to 60 columns per line. The code is as follows. Though I'm getting apparently right results ,I'm a bit sceptical if am I doing it correctly as I didn't use this module before.

use Text::Wrap;
$Text::Wrap::columns = 60;

my $str_60 =Text::Wrap::fill( '', '', join '', uc($longdna_string) );
print $str_60;

perl text • 6.7k views

ADD COMMENT • link updated 10.5 years ago by SES 8.6k • written 10.5 years ago by The ▴ 180

score 4 · Answer 1 · 2013-10-25

4

Entering edit mode

10.5 years ago

SES 8.6k

There may be a reason to use this module, for example to make your code more portable. Though, it seems a little silly to use a module for a such a simple task, so I wanted to provide another solution. Here is a simple solution (borrowing from the answer of Kenosis) that does not use any modules.

#!/usr/bin/env perl

use strict;
use warnings;

my $longdna_string = <<END;
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
END

$longdna_string =~ s/(.{60})/$1\n/gs;

print $longdna_string;

This will produce the wrapped output:

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCG
CTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGG
CAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCC
AGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG
GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGA
AGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACA
GACCTGAA

ADD COMMENT • link 10.5 years ago by SES 8.6k

1

Entering edit mode

Yours is a nice solution! However, consider the following minor modifications:

$longdna_string =~ s/.{60}\K/\n/g;

print uc $longdna_string;

Instead of substituting entire strings, using the \K eep assertion (Perl v5.10+) allows for inserting only the newline at every 60th position, and then finally ucing the results, as the OP had on the original string.

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k

0

Entering edit mode

+1 Thanks for the suggestion, that is a more elegant solution. By the way, why would you want to uc the string after the substitution? It's not clear to me how it would get modified.

ADD REPLY • link 10.5 years ago by SES 8.6k

0

Entering edit mode

You're most welcome!

Seems that the OP wanted to insure an uppercase sequence, so a uced string was sent as a paramater to wrap. The print uc $longdna_string notation just uppercases the string after the (substitution) fact, yet produces the same final results.

We can see how Perl parses it by executing the following at the command line:

perl -MO=Deparse,-p -e 'print uc $longdna_string'

Output:

print(uc($longdna_string));
-e syntax OK

You'll note the nesting, where the results of uc are passed to print.

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k

0

Entering edit mode

I understand what it does, I was curious why you think it's necessary. In other words, why uppercase an uppercase string? This would be a more general solution but I can't figure out why it's necessary here. It would seem pretty unsettling if Perl was randomly changing case because, of course, that is very important in many different contexts.

ADD REPLY • link 10.5 years ago by SES 8.6k

0

Entering edit mode

Ah. My apologies for misunderstang your question.

The only reason I used uc was because the OP did--assumingly for good reason. Certainly, if the original sequence was all uppercase, uc wouldn't be necessary.

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k

1

Entering edit mode

FWIW historically nucleotide sequences have been represented using lower case letters, and protein sequences using upper case. This provides a hint regarding the sequence type and helps avoid handling the sequence as the incorrect type. This convention can be seen the the major databases:

DDBJ: DNA, lower case e.g. http://getentry.ddbj.nig.ac.jp/getentry/na/L12345/?format=flatfile&filetype=html)
DAD: protein, upper case (e.g. http://getentry.ddbj.nig.ac.jp/getentry/dad/AAC69641.1/?filetype=html)
GenBank: DNA, lower case (e.g. http://www.ncbi.nlm.nih.gov/nuccore/l12345)
GenPept: protein, upper case (e.g. http://www.ncbi.nlm.nih.gov/protein/1228925)
ENA EMBL-Bank: DNA, lower case (e.g. http://www.ebi.ac.uk/ena/data/view/L12345&display=text)
UniProtKB: protein, upper case (e.g. http://www.uniprot.org/uniprot/Q29042.txt)

ADD REPLY • link 10.5 years ago by Hamish ★ 3.2k

0

Entering edit mode

Thank you, Hamish, for providing an excellent context (with references) for the OP's original use of uc.

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k

Ram · Answer 2 · 2013-10-25

2

Entering edit mode

10.5 years ago

Kenosis ★ 1.3k

You can just do the following to parse the long DNA string into 60 columns using Text::Wrap:

use strict;
use warnings;
use Text::Wrap;

$Text::Wrap::columns = 61;

my $longdna_string = <<END;
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
END

print wrap('', '', uc $longdna_string);

Output:

ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCG
CTGCCCTGCCCCTGGAGGGTGGCCCCACCGGCCGAGACAGCGAGCATATGCAGGAAGCGG
CAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCC
AGGCCAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAG
GCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAATGCCCTGCAGGAACTTCTTCTGGA
AGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACA
GACCTGAA

Hope this helps!

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.5 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Strangely, this does not produce the correct result. Those lines are 59 columns (not 60), but I can't say why that would be. That may be a bug in the module or something funny with the input (though I copied and pasted your code and it worked for me).

ADD REPLY • link 10.5 years ago by SES 8.6k

0

Entering edit mode

You're quite correct. Excellent eye! This is my error, as the module's documentation clearly says, "...every resulting line will have length of no more than $columns - 1 ." Thus, the wrap value in the case above should have been columns+1 (61, not the original 60), and this has been corrected.

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k

0

Entering edit mode

Interesting, I guess we should call that a 'feature' and not a bug then :). I would bet a lot of people have unknowingly done the same thing since it is not exactly obvious. This is unrelated, but that module Copyright belongs to Google, which is not something I've seen a lot in the Perl world.

ADD REPLY • link 10.5 years ago by SES 8.6k

0

Entering edit mode

ADD REPLY • link 10.5 years ago by Kenosis ★ 1.3k