TACKER: nerdi (Nerdi Veron)
SUBJECT: .. Unicode character boundries
DATE: 24-May-05 12:19:01
HOST: sdf

You need to specify that your input and output streams are utf-8 (or other
relevant encoding), otherwise they are likely to be handled as streams of
eight-bit native bytes. There are many ways to convince Perl to use utf-8.
For example, run "perl -C", or set PERL_UNICODE=1 environment variable, or
just open streams with explicit encoding.

For example, this causes any character (like .) to be handled as utf8:

# if you only specify utf8 for input, then output is converted to latin1 binmode(STDIN, ":utf8"); binmode(STDOUT, ":encoding(utf8)"); # another more general syntax my $line = <STDIN>; $line =~ s/(\S)/[$1]/g; # enclose any non-space unicode char into braces $line =~ s/(\w)/\U$1/g; # convert all unicode letters to uppercase [*] print $line;

Or you may run this one-liner on freeshell to see that this works:

perl -pe 'BEGIN { binmode($_, ":utf8") foreach STDIN, STDOUT } s/(.)/[$1]/g' \ </sys/pkg/share/examples/libutf/langcoll.utf > char_boundaries.utf8

Unfortunately, perl on freeshell is 5.8.0, if it was at least 5.8.1 then
you could simply run "perl -CS" or "env PERL_UNICODE=1 perl" to work with
unicode chars. But binmode/open technique should work in 5.8.0 and later.

[*] Beware that modern BSD systems unlike GNU systems (with GNU libc) may
still lack native support for unicode. I.e. national locales are missing.


man perluniintro man perlunicode man perlocale