Re: graphic chars, set-font and sed

Glynn Clements <glynn@xxxxxxxxxxxxxxxxxx> · Thu, 21 Apr 2005 16:34:29 +0100

Luca Ferrari wrote:

> > > Second, I've noticed that sed regular expressions get
> > > confused by the presence of multiple semigraphic chars, while a single
> > > one seems to work ok. Does anybody knows a way to "escape" those chars,
> > > in order to make them understandable to sed and other programs?
> >
> > sed itself should be 8-bit clean; are you sure that this isn't an
> > encoding (e.g. ISO-8859-1 vs UTF-8) issue?
> 
> I don't know what you mean with "encoding issue". How can I discover it?

An encoding is a mechanism for representing characters as bytes. 
Examples of commonly-used encodings are ASCII, ISO-8859-1 and UTF-8.

ISO-8859-1 is a single-byte encoding. There are 192 printable
characters and 64 control characters, each encoded as a single byte. 
E.g. the character "æ" (a-e ligature, code 230) is represented by the
byte 230 ("\xE6" in C notation).

UTF-8 is a multi-byte encoding. It supports up to 2^31 characters,
each of which is encoded using between 1 and 6 bytes. The first 128
characters (the ASCII subset) are encoded as a single byte; the next
1920 characters are encoded as two bytes. E.g. the character "æ" (a-e
ligature, code 230) is represented by the byte sequence 195,166
("\xC3\xA6" in C notation).

sed itself works with bytes, not characters. This means that it will
work with any single-byte encoding (e.g. ASCII and all of the
ISO-8859-* encodings), but it won't work with multi-byte encodings
such as UTF-8.

If you were to use an expression such as 'æ*' (zero or more
occurrences of the æ character), it would work in ISO-8859-1 (i.e. 
zero or more occurrences of byte 230) but not in UTF-8, where it would
be interpreted as byte 195 followed by zero or more occurrences of
byte 166 (the * operator means "zero or more occurrences of the
preceding byte).

The semigraphic characters aren't part of the ASCII set, so the
sequence of bytes used to represent them will vary depending upon the
encoding which is used.

Essentially, you have to bear in mind that sed's regular expressions,
and the stream of data which it processes, are sequences of bytes, not
characters.

-- 
Glynn Clements <glynn@xxxxxxxxxxxxxxxxxx>
-
: send the line "unsubscribe linux-admin" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html