Luca Ferrari wrote: > > > Second, I've noticed that sed regular expressions get > > > confused by the presence of multiple semigraphic chars, while a single > > > one seems to work ok. Does anybody knows a way to "escape" those chars, > > > in order to make them understandable to sed and other programs? > > > > sed itself should be 8-bit clean; are you sure that this isn't an > > encoding (e.g. ISO-8859-1 vs UTF-8) issue? > > I don't know what you mean with "encoding issue". How can I discover it? An encoding is a mechanism for representing characters as bytes. Examples of commonly-used encodings are ASCII, ISO-8859-1 and UTF-8. ISO-8859-1 is a single-byte encoding. There are 192 printable characters and 64 control characters, each encoded as a single byte. E.g. the character "æ" (a-e ligature, code 230) is represented by the byte 230 ("\xE6" in C notation). UTF-8 is a multi-byte encoding. It supports up to 2^31 characters, each of which is encoded using between 1 and 6 bytes. The first 128 characters (the ASCII subset) are encoded as a single byte; the next 1920 characters are encoded as two bytes. E.g. the character "æ" (a-e ligature, code 230) is represented by the byte sequence 195,166 ("\xC3\xA6" in C notation). sed itself works with bytes, not characters. This means that it will work with any single-byte encoding (e.g. ASCII and all of the ISO-8859-* encodings), but it won't work with multi-byte encodings such as UTF-8. If you were to use an expression such as 'æ*' (zero or more occurrences of the æ character), it would work in ISO-8859-1 (i.e. zero or more occurrences of byte 230) but not in UTF-8, where it would be interpreted as byte 195 followed by zero or more occurrences of byte 166 (the * operator means "zero or more occurrences of the preceding byte). The semigraphic characters aren't part of the ASCII set, so the sequence of bytes used to represent them will vary depending upon the encoding which is used. Essentially, you have to bear in mind that sed's regular expressions, and the stream of data which it processes, are sequences of bytes, not characters. -- Glynn Clements <glynn@xxxxxxxxxxxxxxxxxx> - : send the line "unsubscribe linux-admin" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html