Hi Max, max ulidtko wrote: > $ cat < ÑÐÑÑ > sh: cannot open ÑÐïÑ: No such file With Debian dash 0.5.5.1-7.4: $ dash -c 'cat < ÑÐÑÑ' 2>&1 | LC_ALL=C sed -e 's/dash: cannot open \(.*\):.*/\1/' | xxd 0000000: d182 d0b5 d1d1 820a ........ $ dash -c 'echo ÑÐÑÑ' | xxd 0000000: d182 d0b5 d181 d182 0a ......... The \x81 is being swallowed up. This is <http://bugs.debian.org/532302>, fixed by f8231a ([EXPAND] Fix corruption of redirections with byte 0x81, 2010-05-27). But your question is still interesting from the point of view of investigation, so let's move on to that. > The reason is signed overflow. The parser uses syntax tables to > determine the class to which a given byte (assuming it's a whole > character) belongs. The lookup is done like this: > switch(syntax[c]) { Given confusing code, it is often helpful to learn what the authors were thinking when it was written: $ git log -S'switch(syntax[c])' -- src/parser.c commit 05c1076ba2d1a68fe7f3a5ae618f786b8898d327 Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Date: Mon Sep 26 18:32:28 2005 +1000 Initial import. Well, so much for that. Except, does that mean the signed lookup has been present for five years? So looking at 05c107:src/parser.c, one is led to wonder how c gets set in the first place. c = pgetc(); What did pgetc do? int pgetc(void) { return pgetc_macro(); } And pgetc_macro? extern char *parsenextc; /* next character in input buffer */ [...] #define pgetc_macro() (--parsenleft >= 0? *parsenextc++ : preadbuffer()) Sounds unportable --- the signedness depends on the platform. Okay, so what does syntax[-1] give? 05c107:src/mksyntax.c has some hints: if (sign) base += 1 << (nbits - 1); So syntax starts in the _middle_ of the builtin table when char is signed. That code isn't present in current src/mksyntax.c. What gives, one might wonder? $ git log -1 -S'base +=' -- src/mksyntax.c commit d8014392bc291504997c65b3b44a7f21a60b0e07 Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx> Date: Sun Apr 23 16:01:05 2006 +1000 [PARSER] Only use signed char for syntax arrays The existing scheme of using the native char for syntax array indicies makes cross-compiling difficult. Therefore it makes sense to choose one specific sign for everyone. Since signed chars are native to most platforms and i386, it makes more sense to use that if we are to choose one type for everyone. Ah. Hope that helps, Jonathan -- To unsubscribe from this list: send the line "unsubscribe dash" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html