[BUG] Improper 8-bit parsing because of signed overflow

max ulidtko <ulidtko@xxxxxxxxx> · Sat, 15 Jan 2011 21:24:37 +0200

In parser.c there is a function readtoken1() which fails to properly
parse some 8-bit (i.e. UTF-8) tokens (filenames). Consider the following
test:

$ cat < ÑÐÑÑ
sh: cannot open ÑÐïÑ: No such file
$ echo "ÑÐÑÑ" | od -b
0000000 321 202 320 265 321 201 321 202 012
0000011

Here "ÑÐÑÑ" is four Cyrillic characters which get encoded in 8 bytes of
UTF-8. The third character (sixth byte, \201, to be exact) fails to be
parsed by dash.

The reason is signed overflow. The parser uses syntax tables to
determine the class to which a given byte (assuming it's a whole
character) belongs. The lookup is done like this:
	switch(syntax[c]) {
But the variable c is declared as int. So instead of looking up the
character \201 (129 in decimal) the parser uses signed index -127 to
look up garbage which happens to be not equal to 0 (==CWORD). As a
result, the output token becomes corrupted. 

Here is some gdb output:
(gdb) next
884				switch(syntax[c]) {
8: syntax[c] = 12 '\f'
7: out = 0x8061659 ""
6: stacknxt = 0x8061654 "ÑÐ", <incomplete sequence \321>
5: (char)c = -127 '\201'
(gdb) print syntax[129]
$42 = 0 '\000'
(gdb) print syntax[(unsigned char)c]
$43 = 0 '\000'
(gdb) print syntax[c]
$44 = 12 '\f'

I would note that *any* 8-bit characters are being looked up in syntax
tables incorrectly. Though only some cases lead to user-visible
breakage, this is definitely a bug which needs to be fixed.

But, due to the too lowlevel-ish style of the code I was unable to
figure out working fix. My first suggested change to syntax[(unsigned
char)c] didn't work.

------
Regards,
max ulidtko

--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html