Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07.07. 2015 at 02:02, Duy Nguyen <pclouds@xxxxxxxxx> wrote: 
> On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe <l.s.r@xxxxxx> wrote: 
> > Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy: 

> > So the optimization before this patch was that if a string was searched for 
> > without -F then it would be treated as a fixed string anyway unless it 
> > contained regex special characters. Searching for fixed strings using the 
> > kwset functions is faster than using regcomp and regexec, which makes the 
> > exercise worthwhile. 
> > 
> > Your patch disables the optimization if non-ASCII characters are searched 
> > for because kwset handles case transformations only for ASCII chars. 
> > 
> > Another consequence of this limitation is that -Fi (explicit 
> > case-insensitive fixed-string search) doesn't work properly with non-ASCII 
> > chars neither. How can we handle this one? Fall back to regcomp by 
> > escaping all special characters? Or at least warn? 
> 
> Hehe.. I noticed it too shortly after sending the patch. I was torn 
> between simply documenting the limitation and waiting for the next 
> person to come and fix it, or quoting the regex then passing to 
> regcomp. GNU grep does the quoting in this case, but that code is 
> GPLv3 so we can't simply copy over. It could be a problem if we need 
> to quote a regex in a multibyte charset where ascii is not a subset. 
> But i guess we can just go with utf-8.. 

I played a little bit with the code and I came up with this function to escape
regular expressions in  utf-8. Hope it helps.

static void escape_regexp(const char *pattern, size_t len,
                char **new_pattern, size_t *new_len)
{
        const char *p = pattern;
        char *np = *new_pattern = xmalloc(2 * len);
        int chrlen;
        *new_len = len;

        while (len) {
                chrlen = mbs_chrlen(&p, &len, "utf-8");
                if (chrlen == 1 && is_regex_special(*pattern))
                        *np++ = '\\';

                memcpy(np, pattern, chrlen);
                np += chrlen;
                pattern = p;
        }

        *new_len = np - *new_pattern;
}

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]