Re: [PATCH] git-am: force egrep to use correct characters set

Christian Himpel <chressie@xxxxxxxxxxxxxx> · Mon, 28 Sep 2009 11:32:16 +0200

On Mon, Sep 28, 2009 at 10:12:47AM +0200, Johannes Sixt wrote:
> Jeff King schrieb:
> > On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> > 
> >> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
> >> We do not rely here on the LC_ALL value we get from the environment.
> > 
> > Hmm. Probably makes sense here, as it is a wide enough range that it may
> > pick up other stray non-ascii characters in other charsets (though as
> > the manpage notes, the likely thing is to pick up A-Z along with a-z,
> > which is OK here as we encompass both in our range).
> > 
> > There are two other calls to egrep with brackets (both in
> > git-submodule.sh), but they are just [0-7], which is presumably OK in
> > just about any charset.
> > 
> > Do you happen to know a charset in which this is a problem, just for
> > reference?
> 
> It's not so much about charsets than about languages:
> 
>        Within a bracket expression, a range expression consists
>        of two characters separated by a hyphen.  It matches any
>        single character that sorts between the two  characters,
>        inclusive,  using  the  locale's  collating sequence and
>        character set.  For example, in the  default  C  locale,
>        [a-d]  is equivalent to [abcd].  Many locales sort char-
>        acters in dictionary order, and in these  locales  [a-d]
>        is  typically  not  equivalent  to  [abcd];  it might be
>        equivalent to [aBbCcDd], for  example.   To  obtain  the
>        traditional  interpretation  of bracket expressions, you
>        can use the C locale by setting the  LC_ALL  environment
>        variable to the value C.
> 
> For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
> letters, uppercase letters (!), and umlauts (!!) because in dictionary
> order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
> be UTF-8, of course.)

Thanks for pointing this out.  You are right.  I must have read the
"dictonary order" part over.

> Given that this applies not only to egrep, but to grep in general (and
> perhaps even to other tools that support ranges, like sed), it may be
> necessary to audit all range expressions.

After doing a quick:

LC_ALL=C find . -name '*.sh' -exec \
         egrep -Hne '(grep|awk|sed).*\[.*-.*\]' {} \;

As far as I can see, range expressions are used:

1. to replace or grep hexadecimal numbers (SHA1 sums).  This shouldn't
be a problem, if we can assume that these numbers are never malformed.

2. to replace or grep numbers (with digits).  This shouldn't be a
problem, since digits should be in dictionary order in every language
(?!).

3. in git-rebase--interactive.sh:742 to grep for a previously generated
string.  So it should be safe here.

> The case identified by Christian is certainly important because it is
> applied to a file whose contents can be anything, and the purpose of the
> check is to identify the text as an mbox file, whose header section can be
> only US-ASCII by definition. So, I think it has merit to apply the patch.

Yes.  It seems that this is the only place where it is important to match
just the ASCII printable characters.

Regards,
chressie
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html