Re: Issue in man page charsets.7

Alejandro Colomar <alx.manpages@xxxxxxxxx> · Sun, 5 Feb 2023 15:28:45 +0100

Hi Helge,

On 1/22/23 20:31, Helge Kreutzmann wrote:
Without further ado, the following was found:

Issue:    ISO → ISO/IEC

Please someone write a documented patch for this one.

Cheers,

Alex

"ASCII (American Standard Code For Information Interchange) is the original 7-"
"bit character set, originally designed for American English.  Also known as"
"US-ASCII.  It is currently described by the ISO 646:1991 IRV (International"
"Reference Version) standard."

"The ISO 2022 and 4873 standards describe a font-control model based on VT100"
"practice.  This model is (partially) supported by the Linux kernel and by"
"B<xterm>(1).  Several ISO 2022-based character encodings have been defined,"
"especially for Japanese."

"A 94-character set is designated as GI<n> character set by an escape"
"sequence ESC ( xx (for G0), ESC ) xx (for G1), ESC * xx (for G2), ESC + xx"
"(for G3), where xx is a symbol or a pair of symbols found in the ISO 2375"
"International Register of Coded Character Sets.  For example, ESC ( @"
"selects the ISO 646 character set as G0, ESC ( A selects the UK standard"
"character set (with pound instead of number sign), ESC ( B selects ASCII"
"(with dollar instead of currency sign), ESC ( M selects a character set for"
"African languages, ESC ( ! A selects the Cuban character set, and so on."

"ISO 4873 stipulates a narrower use of character sets, where G0 is fixed"
"(always ASCII), so that G1, G2, and G3 can be invoked only for codes with"
"the high order bit set.  In particular, B<\\(haN> and B<\\(haO> are not used"
"anymore, ESC ( xx can be used only with xx=B, and ESC ) xx, ESC * xx, ESC +"
"xx are equivalent to ESC - xx, ESC . xx, ESC / xx, respectively."

"Unicode (ISO 10646) is a standard which aims to unambiguously represent"
"every character in every human language.  Unicode's structure permits 20.1"
"bits to encode every character.  Since most computers don't include 20.1-bit"
"integers, Unicode is usually encoded as 32-bit integers internally and"
"either a series of 16-bit integers (UTF-16) (needing two 16-bit integers"
"only when encoding certain rare characters) or a series of 8-bit bytes"
"(UTF-8)."

"A byte 110xxxxx is the start of a 2-byte code, and 110xxxxx 10yyyyyy is"
"assembled into 00000xxx xxyyyyyy.  A byte 1110xxxx is the start of a 3-byte"
"code, and 1110xxxx 10yyyyyy 10zzzzzz is assembled into xxxxyyyy yyzzzzzz."
"(When UTF-8 is used to code the 31-bit ISO 10646 then this progression"
"continues up to 6-byte codes.)"

"For most texts in ISO 8859 character sets, this means that the characters"
"outside of ASCII are now coded with two bytes.  This tends to expand"
"ordinary text files by only one or two percent.  For Russian or Greek texts,"
"this expands ordinary text files by 100%, since text in those languages is"
"mostly outside of ASCII.  For Japanese users this means that the 16-bit"
"codes now in common use will take three bytes.  While there are algorithmic"
"conversions from some character sets (especially ISO 8859-1) to Unicode,"
"general conversion requires carrying around conversion tables, which can be"
"quite large for 16-bit codes."

--
<http://www.alejandro-colomar.es/>
GPG key fingerprint: A9348594CE31283A826FBDD8D57633D441E25BB5
Attachment:
OpenPGP_signature

Description: OpenPGP digital signature