Anthony Ramine <n.oxyde@xxxxxxxxx> writes: > ase folding is not done correctly when matching against the [:upper:] > character class and uppercased character ranges (e.g. A-Z). > Specifically, an uppercase letter fails to match against any of them > when case folding is requested because plain characters in the pattern > and the whole string are preemptively lowercased to handle the base case > fast. > > That optimization is kept and ISLOWER() is used in the [:upper:] case > when case folding is requested, while matching against a character range > is retried with toupper() if the character was lowercase, as the bounds > of the range itself cannot be modified (in a case-insensitive context, > [A-_] is not equivalent to [a-_]). > > Signed-off-by: Anthony Ramine <n.oxyde@xxxxxxxxx> > Reviewed-by: Duy Nguyen <pclouds@xxxxxxxxx> > --- Thanks. > t/t3070-wildmatch.sh | 55 ++++++++++++++++++++++++++++++++++++++++++++++------ > wildmatch.c | 7 +++++++ > 2 files changed, 56 insertions(+), 6 deletions(-) > > I added Duy as reviewer and fixed a typo in the commit message reported by > Eric Sunshine. > > diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh > index 4c37057..38446a0 100755 > --- a/t/t3070-wildmatch.sh > +++ b/t/t3070-wildmatch.sh > @@ -6,20 +6,20 @@ test_description='wildmatch tests' > > match() { > if [ $1 = 1 ]; then > - test_expect_success "wildmatch: match '$3' '$4'" " > + test_expect_success "wildmatch: match '$3' '$4'" " > test-wildmatch wildmatch '$3' '$4' > " > else > - test_expect_success "wildmatch: no match '$3' '$4'" " > + test_expect_success "wildmatch: no match '$3' '$4'" " > ! test-wildmatch wildmatch '$3' '$4' > " > fi > if [ $2 = 1 ]; then > - test_expect_success "fnmatch: match '$3' '$4'" " > + test_expect_success "fnmatch: match '$3' '$4'" " > test-wildmatch fnmatch '$3' '$4' > " > elif [ $2 = 0 ]; then > - test_expect_success "fnmatch: no match '$3' '$4'" " > + test_expect_success "fnmatch: no match '$3' '$4'" " > ! test-wildmatch fnmatch '$3' '$4' > " > # else Is the above about aligning $3/$4 across checks of different types (i.e. purely cosmetic)? I am not complaining; just making sure if there is nothing deeper going on. It is outside the scope of this change, but the shell style of this script (most notably use of [] instead of test) needs to be fixed someday, preferrably soon, including the commented out else clause at the end of the hunk. > @@ -235,4 +247,35 @@ pathmatch 1 abcXdefXghi '*X*i' > pathmatch 1 ab/cXd/efXg/hi '*/*X*/*/*i' > pathmatch 1 ab/cXd/efXg/hi '*Xg*i' > > +# Case-sensitivy features > +match 0 x 'a' '[A-Z]' > +match 1 x 'A' '[A-Z]' > +match 0 x 'A' '[a-z]' > +match 1 x 'a' '[a-z]' > +match 0 x 'a' '[[:upper:]]' > +match 1 x 'A' '[[:upper:]]' > +match 0 x 'A' '[[:lower:]]' > +match 1 x 'a' '[[:lower:]]' > +match 0 x 'A' '[B-Za]' > +match 1 x 'a' '[B-Za]' > +match 0 x 'A' '[B-a]' > +match 1 x 'a' '[B-a]' > +match 0 x 'z' '[Z-y]' > +match 1 x 'Z' '[Z-y]' > + > +imatch 1 'a' '[A-Z]' Do we want "# Case-insensitivity features" commment here as well? > +imatch 1 'A' '[A-Z]' > +imatch 1 'A' '[a-z]' > +imatch 1 'a' '[a-z]' > +imatch 1 'a' '[[:upper:]]' > +imatch 1 'A' '[[:upper:]]' > +imatch 1 'A' '[[:lower:]]' > +imatch 1 'a' '[[:lower:]]' > +imatch 1 'A' '[B-Za]' > +imatch 1 'a' '[B-Za]' > +imatch 1 'A' '[B-a]' > +imatch 1 'a' '[B-a]' > +imatch 1 'z' '[Z-y]' > +imatch 1 'Z' '[Z-y]' > + > test_done > diff --git a/wildmatch.c b/wildmatch.c > index 7192bdc..f91ba99 100644 > --- a/wildmatch.c > +++ b/wildmatch.c > @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags) > } > if (t_ch <= p_ch && t_ch >= prev_ch) > matched = 1; > + else if ((flags & WM_CASEFOLD) && ISLOWER(t_ch)) { > + uchar t_ch_upper = toupper(t_ch); > + if (t_ch_upper <= p_ch && t_ch_upper >= prev_ch) > + matched = 1; > + } > p_ch = 0; /* This makes "prev_ch" get set to 0. */ Hmm, this looks somewhat strange. * At the beginning of the outermost "per characters in the text" loop, we seem to downcase t_ch when WM_CASEFOLD is set. * Also at the same place, we also seem to downcase p_ch under the same condition. which makes me wonder why the fix is not like this: + if (flags & WM_CASEFOLD) { + if (ISUPPER(p_ch)) + p_ch = tolower(p_ch); + if (prev_ch && ISUPPER(prev_ch)) + prev_ch = tolower(prev_ch); + } if (t_ch <= p_ch && t_ch >= prev_ch) matched = 1; p_ch = 0; /* This sets "prev_ch" to 0 */ Ahh, OK, the "seemingly strange" construct is about handling a range like "[Z-y]"; we do not want to upcase or downcase the p_ch/prev_ch make the range "[z-y]" (empty) which would exclude bytes like "^", "_" or even "Z". And it is also OK to downcase p_ch in a single-letter case, not the characters in a range, at the beginning of the outermost loop, because we always compare for equality against t_ch (which is downcased) in that case. > @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text, unsigned int flags) > } else if (CC_EQ(s,i, "upper")) { > if (ISUPPER(t_ch)) > matched = 1; > + else if ((flags & WM_CASEFOLD) && ISLOWER(t_ch)) > + matched = 1; This also looks somewhat strange but correct in that t_ch is already downcased so we do not need a corresponding change for CC_EQ("lower") codepath. Interesting. Will apply. Thanks. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html