Re: [PATCH v4 3/4] git mailinfo: added a --recode-patch parameter

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 29 Nov 2010 12:23:41 -0800

"ZHANG, Le" <r0bertz@xxxxxxxxxx> writes:

> When this parameter is specified, patch will be converted to a target
> encoding before applied.  The target encoding defaults to UTF-8. It
> could also be specified by i18n.patchencoding.
>
> Signed-off-by: ZHANG, Le <r0bertz@xxxxxxxxxx>
> ---

Ah, please forget (most of) what I said in my review of 2/4.  This series
is not about incoming e-mails that misidentify their content encoding, but
is about accepting patches from e-mails that are in different encoding and
correctly says which encoding they are in.  We make mailinfo convert the
payload to UTF-8 or "i18n.patchencoding".

The above is a clear indication that "i18n.patchencoding" is grossly
misnamed.  It sounds as if it is about the encoding of patches, and
because we are discussing "mailinfo", I mistook that is about the encoding
used in the incoming patches.

But it is not.

It is about encoding of blobs and paths used in the repository.  With your
series, the mailinfo codepath that deals with incoming patches happens to
become the first user to pay attention to the repository data encoding,
but that is not a good reason to name it "i18n.patchencoding".

In the longer run, it is entirely plausible that we will want to know
about different encodings used for these things:

 - encoding in which blobs and tree paths are stored in the repository.
   This is what you called "i18n.patchencoding".

 - encoding in which the log messages of the commits are stored in the
   repository.  We have "i18n.commitencoding" variable for that.

 - encoding in which incoming patches are given to "mailinfo".  This is
   read from the e-mail message.

 - encoding in which blobs are checked out to the working tree.  In a
   distant future, we may want to re-encode blobs from the repository
   encoding to this encoding while checking out, and the other way while
   checking in.

 - encoding in which readdir(3) returns paths from the working tree.  In a
   distant future, we may want to re-encode paths from the repository
   encoding to this encoding while checking out, and the other way while
   checking in.

I think as a short-term solution to your immediate issue, your series may
be good enough, but if we are to go this route, we really should make it
"the repository encoding".  And the variable that replaces your
"patch_charset" should be declared in cache.h, defined in environment.c,
parsed by git_default_config(), and be accessible to everybody.

An alternative is to use i18n.commitencoding for that purpose, but that
will cause issues with existing repositories---they have been happy
without us recoding the payload, and they will get upset if we suddenly
start doing so.  So I'd say that a new "i18n.repositoryencoding"
configuration variable and repository_encoding variable would be a better
design in the longer term.  They

 (1) default to "binary" (or "bytestring", "literal", "verbatim",
     whatever) to tell us _not_ to do any encoding conversion to the
     contents when accepting patches, generating patches, checking out,
     checking in, or running git-archive; and

 (2) when set, certain operations will pay attention to it; in your
     situation, "mailinfo" will convert incoming e-mails to its value
     (presumably "UTF-8").

Yet even longer term, we probably would need to make the blob encoding
into an attribute to the path (e.g. think about an i18n documentation
project, where different translations may need to be stored in different
encodings).

This will open another big can of worms, though.  Unless the incoming
e-mail is split into a MIME multipart that contains one-patch per file
with each part in different encoding, a single plaintext message needs to
be in a single encoding, so your mailinfo patch would most likely need to
encode to one canonical encoding (i.e. "UTF-8"), and then reencode to per
path encoding when feeding the patch.  Then there is another issue of what
to do with the tree paths that appear in "diff --git" and "rename from ..."
headers which most likely to be a single canonical encoding in the project.

But we need to start from somewhere, so let's make the first step "a
single project wide blob and tree path encoding".

Opinions?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html