Jeff King schrieb: > On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote: > >> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set. >> We do not rely here on the LC_ALL value we get from the environment. > > Hmm. Probably makes sense here, as it is a wide enough range that it may > pick up other stray non-ascii characters in other charsets (though as > the manpage notes, the likely thing is to pick up A-Z along with a-z, > which is OK here as we encompass both in our range). > > There are two other calls to egrep with brackets (both in > git-submodule.sh), but they are just [0-7], which is presumably OK in > just about any charset. > > Do you happen to know a charset in which this is a problem, just for > reference? It's not so much about charsets than about languages: Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale's collating sequence and character set. For example, in the default C locale, [a-d] is equivalent to [abcd]. Many locales sort char- acters in dictionary order, and in these locales [a-d] is typically not equivalent to [abcd]; it might be equivalent to [aBbCcDd], for example. To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value C. For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase letters, uppercase letters (!), and umlauts (!!) because in dictionary order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must be UTF-8, of course.) Given that this applies not only to egrep, but to grep in general (and perhaps even to other tools that support ranges, like sed), it may be necessary to audit all range expressions. The case identified by Christian is certainly important because it is applied to a file whose contents can be anything, and the purpose of the check is to identify the text as an mbox file, whose header section can be only US-ASCII by definition. So, I think it has merit to apply the patch. -- Hannes -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html