Re: [PATCH v3 00/10] grep: move from kwset to optional PCRE v2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Jul 01 2019, Junio C Hamano wrote:

> Ævar Arnfjörð Bjarmason  <avarab@xxxxxxxxx> writes:
>
>> This v3 has a new patch (3/10) that I believe fixes the regression on
>> MinGW Johannes noted in
>> https://public-inbox.org/git/nycvar.QRO.7.76.6.1907011515150.44@xxxxxxxxxxxxxxxxx/
>>
>> As noted in the updated commit message in 10/10 I believe just
>> skipping this test & documenting this in a commit message is the least
>> amount of suck for now. It's really an existing issue with us doing
>> nothing sensible when the log/grep haystack encoding doesn't match the
>> needle encoding supplied via the command line.
>
> Is that quite the case?  If they do not match, not finding the match
> is the right answer, because we are byte-for-byte matching/searching
> IIUC.
>
>> We swept that under the carpet with the kwset backend, but PCRE v2
>> exposes it.
>
> Is it exposing, or just showing the limitation of the rewritten
> implementation where it cannot do byte-for-byte matching/searching
> as we used to be able to?
>
> Without having a way to know what encoding is used on the command
> line, there is no sensible way to reencode them to match the
> haystack encoding (even when it is known), so "you got to feed the
> strings in the same encoding, as we are going to match/search
> byte-for-byte" is the only sensible way to work, given the design
> space, I would think.
>
> Not that it is all that useful to be able to match/search
> byte-for-byte, of course, so I am OK if we punt with these tests,
> but I'd prefer to see us admit we are punting when we do ;-).

I'm guilty as charged in punting this larger encoding issue. As it
pertains to this patch series it unearths an obscure case I think nobody
cares about in practice, and I'd like to move on with the "remove kwset"
optimization.

But I strongly believe that the new behavior with the PCRE v2
optimization is the only sane thing to do, and to the extent we have
anything left to do (#leftoverbits) it's that we should modify git more
generally (aside from string searching) to do the same thing where
appropriate.

Remember, this only happens if the user has set a UTF-8 locale and thus
promised that they're going to give us UTF-8. We then take that promise
and make e.g. "æ" match "Æ" under --ignore-case.

Just falling back on raw byte matching isn't going to cut it, because
then "æ<invalid utf8>" won't match "Æ<same invalid utf8>" under
--ignore-case, and there's other cases like that with matching word
boundaries & other Unicode gotchas.

The best that can be hoped for at that point is some "loose UTF-8"
mode. I see both perl & GNU grep seem to support that (although I'm sure
it falls apart at some point). GNU grep will also die in the same way
that we now die with --perl-regexp (since it also use PCRE).

I think that's saner, if the user thinks they're feeding us UTF-8 but
they're not I think they'd like to know rather than having the string
matching library fall back.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux