On Fri, Oct 08 2021, Jeff King wrote: > On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote: > >> If that were coupled with, say, an advise() call to explain that output >> and matching might be inaccurate (and show that _once_), that might >> might it more clear what's going on. >> >> Now I am sympathetic to flooding the user with too many messages, and >> maybe reducing this to a single instance of "some commit messages could >> not be re-encoded; output and matching might be inaccurate" is the right >> thing. But in a sense, it's also working as designed: what you asked for >> is producing wrong output over and over, and Git is saying so. > > The single-output version would perhaps be something like this: > > diff --git a/pretty.c b/pretty.c > index 708b618cfe..c86f41bae7 100644 > --- a/pretty.c > +++ b/pretty.c > @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding) > return strbuf_detach(&tmp, NULL); > } > > +static void show_encoding_warning(const char *output_encoding) > +{ > + static int seen_warning; > + > + if (seen_warning) > + return; > + > + seen_warning = 1; > + warning("one or more commits could not be re-encoded to '%s'", > + output_encoding); > + advise("When re-encoding fails, some output may be in an unexpected\n" > + "encoding, and pattern matches against commit data may be\n" > + "inaccurate."); > +} > + > const char *repo_logmsg_reencode(struct repository *r, > const struct commit *commit, > char **commit_encoding, > @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r, > * case we just return the commit message verbatim. > */ > if (!out) { > - warning("unable to reencode commit to '%s'", output_encoding); > + show_encoding_warning(output_encoding); > return msg; > } > return out; I'm not categorically opposed to having this warning stay, because I can imagine an implementation of it that isn't so overmatching, but I think neither one of us is going to work on that, so .... We have the exact same edge case in the grep-a-file code, and it's a delibirate decision to stay quiet about it. Let's assume your pattern contains non-ASCII, you've asked for locale-aware grepping, you want \d+ to mean all sorts of Unicode fuzzyness instead of just [0-9] etc. (not yet implemented, PCRE_UCP). It would still be annoying to see a warning every time you grep without providing a pathspec that blacklists say the 100 '*.png' files that are in your tree. And that's a case where we *could* say that the user should mark them with .gitattributes or whatever, but making every git user go through that would be annoying to them, so we just do our best and silently fall back. Similarly with this, let's say I'm on an OS that likes UTF-16 better, as some of our users do, I have the relevant settings to re-encode git.git or linux.git. Now run: git -c i18n.logOutputEncoding=utf16 log --author=foobarbaz And it's 2 warnings in git.git, and 157 in linux.git. Anyway, your commit above makes that 1 in both cases, which is certainly a *lot* better. But I think similar to the grep-a-file case it's still way to much, now just because I've got some old badly encoded commits in my history I'll see one warning every time a log revision walk/grep comes across those. On the "not categorically opposed" I think that this sort of warning /might/ be good if: * It weren't enabled by default, or at least as a transition had something like a advise() message pointing at a fsck.skipList-like (or other instructions, replace?) about how to quiet it. * We're realy dumb with how we chain data->iconv->PCRE here. I.e. we'll whine that we can't reencode just to match my "foobarbaz", but we could just keep walking past bad bytes. We should ideally say "we might have matched your data, but *because* of the encoding failure we couldn't. We can easily know with something like "foobarbaz" that that's not the case. Anyway, I think all of that we can leave for the future, because I'd simply assumed that this was based on some report that had to do with someone not matching with --grep or whatever because of the details of the encoding going wrong, e.g. a string that's later in a commit message, but a misencoded character tripped it up. But in this case this seems to have been because someone tried to feed "HTML" to it, which is not an encoding, and something iconv_open() has (I daresay) always and will always error on. It returns -1 and sets errno=EINVAL. So having a warning or other detection in the revision loop seems backwards to me, surely we want something like the below instead? I.e. die as close to bad option parsing as possible? Note that this will now die if we have NO_ICONV=Y, even with your patch, that seems like a feature. Now we'll silently ignore it. I.e. we'll warn because we failed to re-encode, but we're using a stub function whose body is: { if (e) *e = 0; return NULL; } So ditto the garbage encoding name we should have died a lot earlier. Aside from your warning test the below makes tests in t4201-shortlog.sh fail, but those just seem broken to me. I.e. they seem to rely on git staying quiet if i18n.commitencoding is set to garbage. diff --git a/environment.c b/environment.c index 43bb1b35ffe..c26b18f8e5c 100644 --- a/environment.c +++ b/environment.c @@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath) const char *get_log_output_encoding(void) { - return git_log_output_encoding ? git_log_output_encoding + const char *encoding = git_log_output_encoding ? git_log_output_encoding : get_commit_output_encoding(); +#ifndef NO_ICONV + iconv_t conv; + conv = iconv_open(encoding, "UTF-8"); + if (conv == (iconv_t) -1 && errno == EINVAL) + die_errno("the '%s' encoding is not known to iconv", encoding); +#else + if (strcmp(encoding, "UTF-8")) + die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding); +#endif + return encoding; } const char *get_commit_output_encoding(void)