Wubi Wiki: PerlUnicode

PerlUnicode

HomePage | RecentChanges | Preferences | RSS

TACKER: nerdi (Nerdi Veron)
SUBJECT: .. Unicode character boundries
DATE: 24-May-05 12:19:01
HOST: sdf

You need to specify that your input and output streams are utf-8 (or other
relevant encoding), otherwise they are likely to be handled as streams of
eight-bit native bytes. There are many ways to convince Perl to use utf-8.
For example, run "perl -C", or set PERL_UNICODE=1 environment variable, or
just open streams with explicit encoding.

For example, this causes any character (like .) to be handled as utf8:

# if you only specify utf8 for input, then output is converted to latin1
binmode(STDIN, ":utf8");
binmode(STDOUT, ":encoding(utf8)");  # another more general syntax

my $line = <STDIN>;
$line =~ s/(\S)/[$1]/g;  # enclose any non-space unicode char into braces
$line =~ s/(\w)/\U$1/g;  # convert all unicode letters to uppercase [*]
print $line;

Or you may run this one-liner on freeshell to see that this works:

perl -pe 'BEGIN { binmode($_, ":utf8") foreach STDIN, STDOUT } s/(.)/[$1]/g' \
  </sys/pkg/share/examples/libutf/langcoll.utf  > char_boundaries.utf8

Unfortunately, perl on freeshell is 5.8.0, if it was at least 5.8.1 then
you could simply run "perl -CS" or "env PERL_UNICODE=1 perl" to work with
unicode chars. But binmode/open technique should work in 5.8.0 and later.

[*] Beware that modern BSD systems unlike GNU systems (with GNU libc) may
still lack native support for unicode. I.e. national locales are missing.

Reading:

man perluniintro
man perlunicode
man perlocale