Re: [PATCH] git-am: force egrep to use correct characters set

Johannes Sixt <j.sixt@xxxxxxxxxxxxx> · Mon, 28 Sep 2009 10:12:47 +0200

Jeff King schrieb:
> On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote:
> 
>> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set.
>> We do not rely here on the LC_ALL value we get from the environment.
> 
> Hmm. Probably makes sense here, as it is a wide enough range that it may
> pick up other stray non-ascii characters in other charsets (though as
> the manpage notes, the likely thing is to pick up A-Z along with a-z,
> which is OK here as we encompass both in our range).
> 
> There are two other calls to egrep with brackets (both in
> git-submodule.sh), but they are just [0-7], which is presumably OK in
> just about any charset.
> 
> Do you happen to know a charset in which this is a problem, just for
> reference?

It's not so much about charsets than about languages:

       Within a bracket expression, a range expression consists
       of two characters separated by a hyphen.  It matches any
       single character that sorts between the two  characters,
       inclusive,  using  the  locale's  collating sequence and
       character set.  For example, in the  default  C  locale,
       [a-d]  is equivalent to [abcd].  Many locales sort char-
       acters in dictionary order, and in these  locales  [a-d]
       is  typically  not  equivalent  to  [abcd];  it might be
       equivalent to [aBbCcDd], for  example.   To  obtain  the
       traditional  interpretation  of bracket expressions, you
       can use the C locale by setting the  LC_ALL  environment
       variable to the value C.

For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase
letters, uppercase letters (!), and umlauts (!!) because in dictionary
order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must
be UTF-8, of course.)

Given that this applies not only to egrep, but to grep in general (and
perhaps even to other tools that support ranges, like sed), it may be
necessary to audit all range expressions.

The case identified by Christian is certainly important because it is
applied to a file whose contents can be anything, and the purpose of the
check is to identify the text as an mbox file, whose header section can be
only US-ASCII by definition. So, I think it has merit to apply the patch.

-- Hannes

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html