Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote:
>
> > Under python2, git-p4 "naively" writes the Perforce bytestream into git
> > metadata (and does not set an "encoding" header on the commits); this
> > means that any non-utf-8 byte sequences end up creating invalidly-encoded
> > commit metadata in git.
>
> If it doesn't have an "encoding" header isn't any sequence of bytes OK
> with git, so how does it create invalid metadata in git?

Just because git allows you to shove any sequence of bytes into a
commit header, doesn't mean the resulting text is "valid" metadata
text for all or most purposes. The correct way to encode text in git
commit metadata is utf-8 (OR tell any readers of this data that it's
something other than utf-8 via the encoding header) - it's just that
git itself, the official client, is tolerant of bad byte sequences.
Other clients are less tolerant. "Sublime Merge", for example, will
fail to display the commit text at all in some contexts if the bytes
are not valid utf-8 (or noted as being something else).

>
> Do you mean that something on the Python side gets confused and doesn't
> correctly encode it in that case, or that it's e.g. valid UTF-8 but
> we're lacking the metadata?

In git-p4 under python2, the bytes are simply copied from the perforce
commit metadata into the git commit metadata verbatim; if those bytes
happen to be valid utf-8, then they will be interpreted as such in git
and everything is great. If that is *not* the case, eg the bytes are
actually windows cp1252 (with bytes/characters in the x8a+ range),
then "git log" for example will output the raw bytes, and anything
looking at those bytes (a terminal, or a process that called git) will
get those unexpected bytes, and need to deal accordingly. A terminal
will probably display "unprintable character" glyphs, python3 will
blow up by default, python 2 will be perfectly happy by default, etc.

I summarize this "non-utf-8 bytes in a git commit message without a
qualifying 'encoding' header" situation as "invalidly-encoded commit
metadata in git", due to the impact on downstream consumers of git
metadata. Is there a better characterization?

Thanks,
Tao




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux