Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 13 2022, Tao Klerks wrote:

> On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
> <avarab@xxxxxxxxx> wrote:
>>
>>
>> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>>
>> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
>> > metadata (and does not set an "encoding" header on the commits); this
>> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
>> > commit metadata in git.
>>
>> If it doesn't have an "encoding" header isn't any sequence of bytes OK
>> with git, so how does it create invalid metadata in git?
>
> Just because git allows you to shove any sequence of bytes into a
> commit header, doesn't mean the resulting text is "valid" metadata
> text for all or most purposes. The correct way to encode text in git
> commit metadata is utf-8 (OR tell any readers of this data that it's
> something other than utf-8 via the encoding header) - it's just that
> git itself, the official client, is tolerant of bad byte sequences.
> Other clients are less tolerant. "Sublime Merge", for example, will
> fail to display the commit text at all in some contexts if the bytes
> are not valid utf-8 (or noted as being something else).
>
>>
>> Do you mean that something on the Python side gets confused and doesn't
>> correctly encode it in that case, or that it's e.g. valid UTF-8 but
>> we're lacking the metadata?
>
> In git-p4 under python2, the bytes are simply copied from the perforce
> commit metadata into the git commit metadata verbatim; if those bytes
> happen to be valid utf-8, then they will be interpreted as such in git
> and everything is great. If that is *not* the case, eg the bytes are
> actually windows cp1252 (with bytes/characters in the x8a+ range),
> then "git log" for example will output the raw bytes, and anything
> looking at those bytes (a terminal, or a process that called git) will
> get those unexpected bytes, and need to deal accordingly. A terminal
> will probably display "unprintable character" glyphs, python3 will
> blow up by default, python 2 will be perfectly happy by default, etc.
>
> I summarize this "non-utf-8 bytes in a git commit message without a
> qualifying 'encoding' header" situation as "invalidly-encoded commit
> metadata in git", due to the impact on downstream consumers of git
> metadata. Is there a better characterization?

I must admit that all this time I'd been missing the "Lack of this
header implies that the commit log message is encoded in UTF-8." part of
the docs added in 5dc7bcc2453 (Documentation: i18n commit log message
notes., 2006-12-30). I.e. I thought that "encoding"-header-less just
meant/implied "whatever".

But then again there wouldn't be much point in the encoding header if
its value is "utf8", so surely we'd want to use the lack of a header to
disambiguate utf8 v.s. non-utf8.

AFAICT we only allow selecting between encodings, not "no idea what this
is, but here's some raw sequence of bytes", except by omitting the
header.

It seems to me that between legacy/strict/fallback there's a 4th setting
missing here. I.e. a "try-encoding". One where if your data is valid
utf8 (or cp1252 if we want to get fancy and combine it with "fallback")
you get an encoding header, but having tried that we'll just write
whatever raw data we found, but *not* munge it.

I haven't worked with p4, but having done some legacy SCM imports it was
nice to be able to map data 1=1 to git in those "odd encoding" cases, or
even cases where there was raw binary in a commit message or whatever...

Anyway, all of this is fine with me as-is, I just had a drive-by
question after some admittedly not very careful reading, sorry.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux