Re: [BUG] Improper 8-bit parsing because of signed overflow

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Max,

max ulidtko wrote:

> $ cat < ÑÐÑÑ
> sh: cannot open ÑÐïÑ: No such file

With Debian dash 0.5.5.1-7.4:

 $ dash -c 'cat < ÑÐÑÑ' 2>&1 |
	LC_ALL=C sed -e 's/dash: cannot open \(.*\):.*/\1/' |
	xxd
 0000000: d182 d0b5 d1d1 820a                      ........
 $ dash -c 'echo ÑÐÑÑ' | xxd
 0000000: d182 d0b5 d181 d182 0a                   .........
 
The \x81 is being swallowed up.  This is <http://bugs.debian.org/532302>,
fixed by f8231a ([EXPAND] Fix corruption of redirections with byte 0x81,
2010-05-27).

But your question is still interesting from the point of view of
investigation, so let's move on to that.

> The reason is signed overflow. The parser uses syntax tables to
> determine the class to which a given byte (assuming it's a whole
> character) belongs. The lookup is done like this:
> 	switch(syntax[c]) {

Given confusing code, it is often helpful to learn what the authors
were thinking when it was written:

 $ git log -S'switch(syntax[c])' -- src/parser.c
 commit 05c1076ba2d1a68fe7f3a5ae618f786b8898d327
 Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
 Date:   Mon Sep 26 18:32:28 2005 +1000

     Initial import.

Well, so much for that.  Except, does that mean the signed lookup
has been present for five years?  So looking at 05c107:src/parser.c,
one is led to wonder how c gets set in the first place.

	c = pgetc();

What did pgetc do?

	int
	pgetc(void)
	{
		return pgetc_macro();
	}

And pgetc_macro?

	extern char *parsenextc;		/* next character in input buffer */
[...]
	#define pgetc_macro()	(--parsenleft >= 0? *parsenextc++ : preadbuffer())

Sounds unportable --- the signedness depends on the platform.  Okay,
so what does syntax[-1] give?  05c107:src/mksyntax.c has some hints:

	if (sign)
		base += 1 << (nbits - 1);

So syntax starts in the _middle_ of the builtin table when char is
signed.

That code isn't present in current src/mksyntax.c.  What gives, one
might wonder?

 $ git log -1 -S'base +=' -- src/mksyntax.c
 commit d8014392bc291504997c65b3b44a7f21a60b0e07
 Author: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
 Date:   Sun Apr 23 16:01:05 2006 +1000

     [PARSER] Only use signed char for syntax arrays

     The existing scheme of using the native char for syntax array indicies
     makes cross-compiling difficult.  Therefore it makes sense to choose
     one specific sign for everyone.

     Since signed chars are native to most platforms and i386, it makes more
     sense to use that if we are to choose one type for everyone.

Ah.

Hope that helps,
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe dash" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [LARTC]     [Bugtraq]     [Yosemite Forum]     [Photo]

  Powered by Linux