In parser.c there is a function readtoken1() which fails to properly parse some 8-bit (i.e. UTF-8) tokens (filenames). Consider the following test: $ cat < ÑÐÑÑ sh: cannot open ÑÐïÑ: No such file $ echo "ÑÐÑÑ" | od -b 0000000 321 202 320 265 321 201 321 202 012 0000011 Here "ÑÐÑÑ" is four Cyrillic characters which get encoded in 8 bytes of UTF-8. The third character (sixth byte, \201, to be exact) fails to be parsed by dash. The reason is signed overflow. The parser uses syntax tables to determine the class to which a given byte (assuming it's a whole character) belongs. The lookup is done like this: switch(syntax[c]) { But the variable c is declared as int. So instead of looking up the character \201 (129 in decimal) the parser uses signed index -127 to look up garbage which happens to be not equal to 0 (==CWORD). As a result, the output token becomes corrupted. Here is some gdb output: (gdb) next 884 switch(syntax[c]) { 8: syntax[c] = 12 '\f' 7: out = 0x8061659 "" 6: stacknxt = 0x8061654 "ÑÐ", <incomplete sequence \321> 5: (char)c = -127 '\201' (gdb) print syntax[129] $42 = 0 '\000' (gdb) print syntax[(unsigned char)c] $43 = 0 '\000' (gdb) print syntax[c] $44 = 12 '\f' I would note that *any* 8-bit characters are being looked up in syntax tables incorrectly. Though only some cases lead to user-visible breakage, this is definitely a bug which needs to be fixed. But, due to the too lowlevel-ish style of the code I was unable to figure out working fix. My first suggested change to syntax[(unsigned char)c] didn't work. ------ Regards, max ulidtko -- To unsubscribe from this list: send the line "unsubscribe dash" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html