On Wed, Apr 13, 2022 at 4:01 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > > On Wed, Apr 13 2022, Tao Klerks via GitGitGadget wrote: > > > Under python2, git-p4 "naively" writes the Perforce bytestream into git > > metadata (and does not set an "encoding" header on the commits); this > > means that any non-utf-8 byte sequences end up creating invalidly-encoded > > commit metadata in git. > > If it doesn't have an "encoding" header isn't any sequence of bytes OK > with git, so how does it create invalid metadata in git? Just because git allows you to shove any sequence of bytes into a commit header, doesn't mean the resulting text is "valid" metadata text for all or most purposes. The correct way to encode text in git commit metadata is utf-8 (OR tell any readers of this data that it's something other than utf-8 via the encoding header) - it's just that git itself, the official client, is tolerant of bad byte sequences. Other clients are less tolerant. "Sublime Merge", for example, will fail to display the commit text at all in some contexts if the bytes are not valid utf-8 (or noted as being something else). > > Do you mean that something on the Python side gets confused and doesn't > correctly encode it in that case, or that it's e.g. valid UTF-8 but > we're lacking the metadata? In git-p4 under python2, the bytes are simply copied from the perforce commit metadata into the git commit metadata verbatim; if those bytes happen to be valid utf-8, then they will be interpreted as such in git and everything is great. If that is *not* the case, eg the bytes are actually windows cp1252 (with bytes/characters in the x8a+ range), then "git log" for example will output the raw bytes, and anything looking at those bytes (a terminal, or a process that called git) will get those unexpected bytes, and need to deal accordingly. A terminal will probably display "unprintable character" glyphs, python3 will blow up by default, python 2 will be perfectly happy by default, etc. I summarize this "non-utf-8 bytes in a git commit message without a qualifying 'encoding' header" situation as "invalidly-encoded commit metadata in git", due to the impact on downstream consumers of git metadata. Is there a better characterization? Thanks, Tao