On Thu, Nov 18, 2021 at 12:42 AM Hamza Mahfooz <someguy@xxxxxxxxxxxxxxxxxxx> wrote: > > UTF mode is enabled for cases that cause older versions of PCRE to break. Not really; what is broken is our implementation of how PCRE gets called and that ignores the fact that giving it invalid UTF-8 (which might be valid LATIN-1 text for example) and telling it to do a match using UTF, will fail (if we are lucky even with an error) or might even crash (and obviously don't match) if we also tell it to not do the validation, and which is something we do when JIT is enabled. > This is primarily due to the fact that we can't make as many assumptions on > the kind of data that is fed to "git grep." So, limit when UTF mode can be > enabled by introducing "is_log" to struct grep_opt, checking to see if it's > a non-zero value in compile_pcre2_pattern() and only mutating it in > cmd_log() so that we know "git log" was invoked if it's set to a non-zero > value. I haven't tested it, but I think that for this to work with the log, we also need to make sure that all log entries that might not be UTF-8 get first iconv() which is why probably Æevar mentioned[1] i18n.commitEncoding in his old email. Of course doing that translation only makes sense if the log output is meant to be UTF-8 which is why there is all that logic about being in an UTF-8 locale or not which probably needs to be adjusted as well. Carlo [1] https://lore.kernel.org/git/87v92bju64.fsf@xxxxxxxxxxxxxxxxxxx/