Re: [PATCH v2 2/4] diff.c: implement a sanity check for word regexes

Thomas Rast <trast@xxxxxxxxxxxxxxx> · Sun, 19 Dec 2010 02:59:44 +0100

Junio C Hamano wrote:
> Thomas Rast <trast@xxxxxxxxxxxxxxx> writes:
> 
> > * The word regex matches anything that is !isspace().
> >
> > * The word regex does not match '\n'.  (This case is not very harmful,
> >   but we used to silently cut off at the '\n' which may go against
> >   user expectations.)
> 
> How expensive to run this check twice, every time word_regex finds a
> match?

It runs the first bullet point for every non-match, and the second
bullet point for every match.  So it looks at every input character
exactly once.

> As this is about making sure that we got a sane regex from the user (or a
> builtin pattern), I wonder if we can make it not depend on the payload we
> are matching the regex against.  Then before using a word_regex that we
> have not checked, we check if that regex is sane, mark it checked, and do
> not have to do the check over and over again.

Algorithmically it should be easy once you have the finite state
automaton corresponding to the regex: just verify that for every
possible non-terminal state, there is a transition for every
!isspace() character to a state other than "fail to match" or "match
the empty string".

In the implementation, it might be doable if we switch to compat/regex
on all platforms, since we then have ready access to all internal
structures regcomp() creates, including the DFA.

I'll think about at least using compat/regex for a static check of all
*builtin* patterns, which would be superior to the brute force
approach in 4/4.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html