Re: [PATCH] Unicode: update of combining code points

Torsten Bögershausen <tboegi@xxxxxx> · Wed, 09 Apr 2014 18:48:29 +0200

On 04/09/2014 12:37 AM, Junio C Hamano wrote:
> Jonathan Nieder <jrnieder@xxxxxxxxx> writes:
>
>> Torsten Bögershausen wrote:
>>
>>> Unicode 6.3 defines the following code as combining or accents,
>>> git_wcwidth() should return 0.
>>>
>>> Earlier unicode standards had defined these code point as "reserved":
>> Thanks for the update.  Could the commit message also explain how this
>> was noticed and what the user-visible effect is?
>>
>> For example:
>>
>>  "Unicode just announced that <...>.  That means we should mark the
>>   relevant code points as combining characters so git knows they are
>>   zero-width and doesn't screw up the alignment when presenting branch
>>   names in columns with 'git branch --column'"
>>
>> or something like that.
> Perhaps (the original read clearly enough for me, though).
>
>> [...]
>>> 358 COMBINING DOT ABOVE RIGHT
>>> 359 COMBINING ASTERISK BELOW
>> I'm not sure this list is needed --- the code + the reference to the
>> Unicode 6.3 standard seems like enough (but if you think otherwise,
>> I don't really mind).
> I can go either way.
>
>>> This commit touches only the range 300-6FF, there may be more to be updated.
>> The "there may be more" here sounds ominous.
> Indeed it does ;-)
>
>> Does that mean Unicode
>> 6.3 also added some zero-width characters in other ranges that should
>> be dealt with in the future?  How many such ranges?  How do we know
>> when we're done?
>>
>> Just biting off the most important characters first and putting off
>> the rest for later sounds fine to me --- my complaint is that the
>> above comment doesn't make clear what the to-do list is for finishing
>> the update later.
> I'll queue this at the tip of 'pu', not to forget about it while
> waiting for a clarification.
>
> Thanks.
Thanks for comments, here comes the long version of the strory:
I recently fooled myself by running
"git config --global user.name" with a decomposed "ö" on a new Mac OS X machine.

While there was little problems on Mac OS, all Windows and Linux machines stumbled
over the decomposed ö, to be more exact over 0x308, COMBINING DIARESIS, (the 2 dots),
giving all kind of weired output in "git log".

Looking into commit.c and utf8.c, how to improve the situation, I made this observations:
- Some code from commit.c can possibly be moved into utf8.c, so that we only
  have 1 utf8 code parser.
- A solution would be to run precompose_string() under Mac OS (which is a nop otherwise).
  This could have saved my day. Probably I will make a patch some day.
- Some of the combining code points exist in Unicode 6.3, but not in utf8.c
  (which seams to be based on Unicode >2.0 <6.3)
  I found some in the 0x300 area, and looked at the neighbors, and had enough time to
  read all code pages up to 0x7FF. 

 So if somebody knows how to find out which code points that are combined, accents,,, or in other words should return 0 in git_wcwidth(), please let me know.

How about this as a commit message:

Unicode: partially update to version 6.3

Unicode 6.3 defines the following code points as combining or accents,
git_wcwidth() should return 0.

Earlier unicode standards had defined these code point as "reserved":
358--35C
487
5A2, 5BA, 5C5, 5C7
604, 616--61A, 659--65F

Note: for this commit only the range 0..7FF has been checked,
more updates may be needed.

Signed-off-by: Torsten Bögershausen <tboegi@xxxxxx>

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html