Re: [PATCH 3/2] attribute "coding": specify blob encoding

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 03 Jan 2008 13:54:58 -0800

しらいしななこ  <nanako3@xxxxxxxxxxxxxx> writes:

> Quoting Junio C Hamano <gitster@xxxxxxxxx>:
> 
>> This teaches "diff hunk header" function about custom character
>> encoding per path via gitattributes(5) so that it can sensibly
>> chomp a line without truncating a character in the middle.
>>
>> Signed-off-by: Junio C Hamano <gitster@xxxxxxxxx>
>> ---
>>
>>  * This is not intended for serious inclusion, but was done more
>>    as a demonstration of the concept, hence [3/2].
>
> Why not?  It looks a useful addition for us Japanese people.

    (offtopic) I was once told that "us Japanese people" is a
    bad thing to say in public because it sets an unfriendly
    tone by creating a psychological divide between "us" and
    "others".  After all I am one of you ;-)

The reason I do not like the patch as-is is because I have
doubts about the way "coding" acts in the patch.

There already is clean/smudge filter mechanism.  Even if your
work tree has files in euc-jp or Shift_JIS, you could choose to
internally use UTF-8 at git object level.  Then the part that
deals with diff hunk headers (the topic of the patch we are
discussing) would have to work only on UTF-8 data.

	Side note: when getting the data from a file in the work
	tree, we convert into internal representation before
	running diff (see diff.c::diff_populate_filespec()), but
	we do not convert it back to external representation by
	running the smudge filter on the diff output.  We might
	optionally want to but if somebody is going to do this,
	the patch accepting side also needs to be modified to
	reverse the conversion.

The solution with clean/smudge is not applicable to everybody.
It needs to be agreed upon project-wide what encoding is used as
the canonical encoding for the project, and when the project
chooses to use UTF-8, the above would become a cleaner and
workable approach.

If the project, on the other hand, chooses to use a non UTF-8
encoding (e.g. euc-jp) as the canonical representation,
something like my patch may be necessary.

Between these two ways to skin the cat, I do not want to close
the door for either one of them too early, although I am
somewhat partial to "internally everything is UTF-8" approach.

Maybe we would want to use "coding" (short, sweet and nice name
for an attribute) to mean a canned smudge/clean filter pair that
runs to/from UTF-8 iconv, making the "internally, everything is
UTF-8" approach a more officially supported option.  If we
choose to go that route, the way "coding" attribute was used in
the patch directly conflicts with that design, as the world view
my "coding" patch takes is "whatever coding project chooses is
used internally, and the attribute is used to teach coding
specific actions to the underlying logic".

>> +static struct {
>> +	const char *coding;
>> +	sane_truncate_fn fn;
>> +} builtin_truncate_fn[] = {
>> +	{ "euc-jp", truncate_euc_jp },
>> +	{ "utf-8", NULL },
>> +};
>
>Can you also do JIS and Shift JIS?  I ask because many of my
>old notes are in Shift JIS and I think it is the same for many
>other people. 

I guess somebody else could (hint, hint,...).  Shift_JIS should
be more or less straightforward to add.

With the current code structure, however, ISO-2022 (you said
"JIS" -- Japanese often use that word to mean 7-bit ISO-2022 and
so did you in this context) is a bit cumbersome to handle, as
you cannot just truncate but also have to add a few escape bytes
to go back to ASCII at the end of line.
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html