On Mon, Sep 28, 2009 at 10:12:47AM +0200, Johannes Sixt wrote: > Jeff King schrieb: > > On Fri, Sep 25, 2009 at 06:43:20PM +0200, Christian Himpel wrote: > > > >> According to egrep(1) the US-ASCII table is used when LC_ALL=C is set. > >> We do not rely here on the LC_ALL value we get from the environment. > > > > Hmm. Probably makes sense here, as it is a wide enough range that it may > > pick up other stray non-ascii characters in other charsets (though as > > the manpage notes, the likely thing is to pick up A-Z along with a-z, > > which is OK here as we encompass both in our range). > > > > There are two other calls to egrep with brackets (both in > > git-submodule.sh), but they are just [0-7], which is presumably OK in > > just about any charset. > > > > Do you happen to know a charset in which this is a problem, just for > > reference? > > It's not so much about charsets than about languages: > > Within a bracket expression, a range expression consists > of two characters separated by a hyphen. It matches any > single character that sorts between the two characters, > inclusive, using the locale's collating sequence and > character set. For example, in the default C locale, > [a-d] is equivalent to [abcd]. Many locales sort char- > acters in dictionary order, and in these locales [a-d] > is typically not equivalent to [abcd]; it might be > equivalent to [aBbCcDd], for example. To obtain the > traditional interpretation of bracket expressions, you > can use the C locale by setting the LC_ALL environment > variable to the value C. > > For example, in locale de_DE.UTF-8, GNU grep '[a-z]' matches lowercase > letters, uppercase letters (!), and umlauts (!!) because in dictionary > order, 'A' and 'a' are equivalent and 'Ä' sorts after 'A'. (The input must > be UTF-8, of course.) Thanks for pointing this out. You are right. I must have read the "dictonary order" part over. > Given that this applies not only to egrep, but to grep in general (and > perhaps even to other tools that support ranges, like sed), it may be > necessary to audit all range expressions. After doing a quick: LC_ALL=C find . -name '*.sh' -exec \ egrep -Hne '(grep|awk|sed).*\[.*-.*\]' {} \; As far as I can see, range expressions are used: 1. to replace or grep hexadecimal numbers (SHA1 sums). This shouldn't be a problem, if we can assume that these numbers are never malformed. 2. to replace or grep numbers (with digits). This shouldn't be a problem, since digits should be in dictionary order in every language (?!). 3. in git-rebase--interactive.sh:742 to grep for a previously generated string. So it should be safe here. > The case identified by Christian is certainly important because it is > applied to a file whose contents can be anything, and the purpose of the > check is to identify the text as an mbox file, whose header section can be > only US-ASCII by definition. So, I think it has merit to apply the patch. Yes. It seems that this is the only place where it is important to match just the ASCII printable characters. Regards, chressie -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html