Re: [PATCH 00/22] Refactor to accept NUL in commit messages

Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> · Sun, 23 Oct 2011 21:17:41 +1100

On Sun, Oct 23, 2011 at 8:46 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Nguyen Thai Ngoc Duy <pclouds@xxxxxxxxx> writes:
>
>> On Sun, Oct 23, 2011 at 4:51 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>> ...
>>> The low level object format of our commit is textual header fields, each
>>> of which is terminated with a LF, followed by a LF to mark the end of
>>> header fields, and then opaque payload that can contain any bytes. It does
>>> not forbid a non-Git application to reuse the object store infrastructure
>>> to store ASN.1 binary goo there, and the low level interface we give such
>>> as cat-file is a perfectly valid way to inspect such a "commit" object.
>>
>> cat-file is fine, commit-tree (or any commands that call
>> commit_tree()) cuts at NUL though.
>> I wonder how git processes commit messages in utf-16.
>
> That is exactly what I am saying.
>
> Perhaps you didn't either read or understand what you omitted from your
> quoting; otherwise you even wouldn't have brought up utf-16.
>
> Let me requote that part for you.
>
>> But when it comes to "Git" Porcelains (e.g. the log family of commands),
>> we do assume people do not store random binary byte sequences in commits,
>> and we do take advantage of that assumption by splitting each "line" at
>> LF, indenting them with 4 spaces, etc. In other words, a commit log in the
>> Git context _is_ pretty much text and not arbitrary byte sequence.
>
> Think what would cutting at a byte whose value is 012 and adding four
> bytes whose values are 040 to each of "lines" that formed with such
> cutting do to UTF-16 goo, even if it does not contain any NUL byte. As far
> as Git Porcelains are concerned, it is no different from random binary
> byte sequences.
>

I'm sorry. The utf-16 was an afterthought when I was nearly finished
with the reply and already cut that quote.

The assumption that people do not store random binary byte sequences
in commits sort of conflicts with "encoding" field in the commit
header though. The assumption is documented in i18n.txt. I guess it's
just me who did not read document carefully. But maybe it's good to
stop people from shooting themselves in this case (i.e. setting
encoding to utf-16 or similar).
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html