Re: [PATCH v2 21/29] grep: factor test for \0 in grep patterns into a function

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Mon, 15 May 2017 20:07:49 +0200

On Mon, May 15, 2017 at 8:24 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Ævar Arnfjörð Bjarmason  <avarab@xxxxxxxxx> writes:
>
>> Factor the test for \0 in grep patterns into a function. Since commit
>> 9eceddeec6 ("Use kwset in grep", 2011-08-21) any pattern containing a
>> \0 is considered fixed as regcomp() can't handle it.
>>
>> This limitation was never documented, and other some regular
>> expression engines are capable of compiling a pattern containing a
>> \0. Factoring this out makes a subsequent change which does that
>> smaller.
>
> The latter half of the first sentence of this paragraph does not
> parse well, around "and other some".  "never documented" makes
> readers expect "... so let's document it", but I think you are
> driving in a different direction, something like...

I'll try to reword it.

>         Since some other regular expression engines are capable of a
>         pattern to match a string with NUL in it, it would be nice
>         if we can use such an engine and match against NUL; as this
>         "patterns with NUL always use --fixed match" is not documented,
>         let's hope that nobody depend on that current behaviour.
>         This step does not yet change the behaviour yet, but it
>         makes it easier to do so in later steps.
>
> perhaps?  I tend to agree that it is OK to break users who depended
> on that "patterns with NULL match with --fixed" (partly because it
> is so unintuitive and not useful behaviour, and partly because it is
> easy for them to explicitly say -F), but let's make sure we clearly
> say that we will be knowingly breaking them.

I think there's a few different things going on here, which I'll
address accordingly.

 * Yes, if someone is relying on the undocumented behavior of a \0
pattern implying -F if and when I change that their stuff will break.
I'm going to actually just drop this whole discussion from this
commit, we can have an explanation for that when and if I actually
change it in a later series.

* I'm making things more confusing by saying "engines" here, but it's
important to note that git has never in its docs promised to use the C
library implementation of POSIX regexes, it's just advertised that it
uses POSIX regular *expressions*, which are a different thing.
Implementation != syntax.

E.g. GNU grep uses POSIX regex syntax, but its engine does not have
the same limitation as ours:

    $ (command cd /tmp && printf "æ\0ð\n" >a && printf "æ\n" >> a;
printf "æ\0[öð]" >f; grep -a -f f a; echo $?)
    æð
    0

I went down this particular rabbit hole because I was genuinely
surprised that this didn't work with git-grep, until I realized that
of course the implementation detail of using C strings for this under
the hood was bleeding through.