| Home | MIME::Body | MIME::Decoder | MIME::Entity | MIME::Head |
| MIME::IO | MIME::Latin1 | MIME::Parser | MIME::ParserBase | |
| MIME::ToolUtils |
MIME::
|
NAME
SYNOPSIS
use MIME::Latin1 qw(latin1_to_ascii);
$dirty = "Fran\347ois";
print latin1_to_ascii($dirty); # prints out "Fran\c,ois"
DESCRIPTION"7bit" encoder/decoder for handling the case where a user wants to 7bit-encode a
document that contains 8-bit (presumably Latin-1) characters. It provides a
mapping whereby every 8 bit character is mapped to a unique sequence of two
7-bit characters that approximates the appearance or pronunciation of the
Latin-1 character. For example:
This... maps to...
--------------------------------------------------
A c with a cedilla c,
A C with a cedilla C,
An "AE" ligature AE
An "ae" ligature ae
Yen sign Y-
I call each of these 7-bit 2-character encodings mnemonic encodings , since they (hopefully) are visually reminiscent of the 8-bit characters they are meant to represent.
PUBLIC INTERFACE
\xy
Where xy is a two-character sequence that visually approximates the Latin-1
character. For example:
c cedilla => \c,
n tilde => \n~
AE ligature => \AE
small o slash => \o/
The sequences are taken almost exactly from the Sun character composition sequences for generating these characters. The translation may be further tweaked by the (optional) OPTS string:
\xy:
\<<Fran\c,ois M\u"ller\>> c:\usr\games
"\" is not inserted, making the output more compact:
<<Franc,ois Mu"ller>> c:\usr\games
"\" output, but any other occurences of "\"
are escaped as well by turning them into "\\". Unlike the other options, this produces output which may easily be parsed
and turned back into the original 8-bit characters, so in a way it is its
own full-fledged encoding... and given that "\" is a rare-enough character, not much uglier that the normal output:
\<<Fran\c,ois M\u"ller\>> c:\\usr\\games
You may use ascii_to_latin1 to decode this.
\xy) back into actual 8-bit characters.
# Assume $enc holds the actual text... \<<Fran\c,ois \\ M\u"ller\>> print ascii_to_latin1($enc);
Unrecognized sequences are turned into '?' characters.
Note: you must have specified the ``ENCODE'' option when encoding in order to decode!
NOTES
80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
To allow this scheme to work properly for all 8-bit-on characters, the general rule is: the first hex digit is DOWNcased, and the second hex digit is UPcased. Hence, these are all decodable sequences:
a0 a1 a2 a3 a4 a5 a6 a7 a8 a9 aA aB aC aD aE aF
This ``downcase-upcase'' style is so we don't conflict with mnemonically-encoded ligatures like ``ae'' and ``AE'', the latter of which could reasonably have been represented as ``Ae''.
Note that we must never have a mnemonic encoding that could be mistaken for a hex sequence from ``80'' to ``fF'', since the ambiguity would make it impossible to decode. (However, ``12'', ``34'', ``Ff'', etc. are perfectly fine.)
Thanks to Rolf Nelson for reporting the ``gap'' in the encoding.
ascii_to_latin1() to perform the reverse mapping. I will
strive for backwards-compatibility in that code.
AUTHORAll rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
VERSION