Re: Really noisy encoding warnings post-v2.33.0

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 09 Oct 2021 15:47:16 +0200

On Fri, Oct 08 2021, Jeff King wrote:

> On Fri, Oct 08, 2021 at 10:36:02PM -0400, Jeff King wrote:
>
>> If that were coupled with, say, an advise() call to explain that output
>> and matching might be inaccurate (and show that _once_), that might
>> might it more clear what's going on.
>> 
>> Now I am sympathetic to flooding the user with too many messages, and
>> maybe reducing this to a single instance of "some commit messages could
>> not be re-encoded; output and matching might be inaccurate" is the right
>> thing. But in a sense, it's also working as designed: what you asked for
>> is producing wrong output over and over, and Git is saying so.
>
> The single-output version would perhaps be something like this:
>
> diff --git a/pretty.c b/pretty.c
> index 708b618cfe..c86f41bae7 100644
> --- a/pretty.c
> +++ b/pretty.c
> @@ -606,6 +606,21 @@ static char *replace_encoding_header(char *buf, const char *encoding)
>  	return strbuf_detach(&tmp, NULL);
>  }
>  
> +static void show_encoding_warning(const char *output_encoding)
> +{
> +	static int seen_warning;
> +
> +	if (seen_warning)
> +		return;
> +
> +	seen_warning = 1;
> +	warning("one or more commits could not be re-encoded to '%s'",
> +		output_encoding);
> +	advise("When re-encoding fails, some output may be in an unexpected\n"
> +	       "encoding, and pattern matches against commit data may be\n"
> +	       "inaccurate.");
> +}
> +
>  const char *repo_logmsg_reencode(struct repository *r,
>  				 const struct commit *commit,
>  				 char **commit_encoding,
> @@ -673,7 +688,7 @@ const char *repo_logmsg_reencode(struct repository *r,
>  	 * case we just return the commit message verbatim.
>  	 */
>  	if (!out) {
> -		warning("unable to reencode commit to '%s'", output_encoding);
> +		show_encoding_warning(output_encoding);
>  		return msg;
>  	}
>  	return out;

I'm not categorically opposed to having this warning stay, because I can
imagine an implementation of it that isn't so overmatching, but I think
neither one of us is going to work on that, so ....

We have the exact same edge case in the grep-a-file code, and it's a
delibirate decision to stay quiet about it.

Let's assume your pattern contains non-ASCII, you've asked for
locale-aware grepping, you want \d+ to mean all sorts of Unicode
fuzzyness instead of just [0-9] etc. (not yet implemented, PCRE_UCP).

It would still be annoying to see a warning every time you grep without
providing a pathspec that blacklists say the 100 '*.png' files that are
in your tree.

And that's a case where we *could* say that the user should mark them
with .gitattributes or whatever, but making every git user go through
that would be annoying to them, so we just do our best and silently fall
back.

Similarly with this, let's say I'm on an OS that likes UTF-16 better, as
some of our users do, I have the relevant settings to re-encode git.git
or linux.git. Now run:

    git -c i18n.logOutputEncoding=utf16 log --author=foobarbaz

And it's 2 warnings in git.git, and 157 in linux.git. Anyway, your
commit above makes that 1 in both cases, which is certainly a *lot*
better.

But I think similar to the grep-a-file case it's still way to much, now
just because I've got some old badly encoded commits in my history I'll
see one warning every time a log revision walk/grep comes across those.

On the "not categorically opposed" I think that this sort of warning
/might/ be good if:

 * It weren't enabled by default, or at least as a transition had
   something like a advise() message pointing at a fsck.skipList-like
   (or other instructions, replace?) about how to quiet it.

 * We're realy dumb with how we chain data->iconv->PCRE here. I.e. we'll
   whine that we can't reencode just to match my "foobarbaz", but we
   could just keep walking past bad bytes. We should ideally say "we
   might have matched your data, but *because* of the encoding failure
   we couldn't.

   We can easily know with something like "foobarbaz" that that's not
   the case.

Anyway, I think all of that we can leave for the future, because I'd
simply assumed that this was based on some report that had to do with
someone not matching with --grep or whatever because of the details of
the encoding going wrong, e.g. a string that's later in a commit
message, but a misencoded character tripped it up.

But in this case this seems to have been because someone tried to feed
"HTML" to it, which is not an encoding, and something iconv_open() has
(I daresay) always and will always error on. It returns -1 and sets
errno=EINVAL.

So having a warning or other detection in the revision loop seems
backwards to me, surely we want something like the below instead?
I.e. die as close to bad option parsing as possible?

Note that this will now die if we have NO_ICONV=Y, even with your patch,
that seems like a feature. Now we'll silently ignore it. I.e. we'll warn
because we failed to re-encode, but we're using a stub function whose
body is:

    { if (e) *e = 0; return NULL; }

So ditto the garbage encoding name we should have died a lot earlier.

Aside from your warning test the below makes tests in t4201-shortlog.sh
fail, but those just seem broken to me. I.e. they seem to rely on git
staying quiet if i18n.commitencoding is set to garbage.

diff --git a/environment.c b/environment.c
index 43bb1b35ffe..c26b18f8e5c 100644
--- a/environment.c
+++ b/environment.c
@@ -357,8 +357,18 @@ void set_git_dir(const char *path, int make_realpath)
 
 const char *get_log_output_encoding(void)
 {
-	return git_log_output_encoding ? git_log_output_encoding
+	const char *encoding = git_log_output_encoding ? git_log_output_encoding
 		: get_commit_output_encoding();
+#ifndef NO_ICONV
+	iconv_t conv;
+	conv = iconv_open(encoding, "UTF-8");
+	if (conv == (iconv_t) -1 && errno == EINVAL)
+		die_errno("the '%s' encoding is not known to iconv", encoding);
+#else
+	if (strcmp(encoding, "UTF-8"))
+		die("compiled with NO_ICONV=Y, can't re-encode to '%s'", encoding);
+#endif
+	return encoding;
 }
 
 const char *get_commit_output_encoding(void)








Re: *Really* noisy encoding warnings post-v2.33.0

Re: Really noisy encoding warnings post-v2.33.0