Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings

Tao Klerks <tao@xxxxxxxxxx> · Thu, 14 Apr 2022 11:38:42 +0200

On Wed, Apr 13, 2022 at 9:04 PM Ævar Arnfjörð Bjarmason
<avarab@xxxxxxxxx> wrote:
>
>
> AFAICT we only allow selecting between encodings, not "no idea what this
> is, but here's some raw sequence of bytes", except by omitting the
> header.
>

I don't understand what you mean here. My reading of the docs is that
omitting the header *does not* imply "no idea what this is" - it
implies "this is utf-8".

Git will do the best it can (convert to utf-8 at output time if the
bytes are parseable in the specified or implied encoding, and return
the raw bytes otherwise), *regardless* of whether an encoding is
explicitly specified or utf-8 is implied.

> It seems to me that between legacy/strict/fallback there's a 4th setting
> missing here. I.e. a "try-encoding". One where if your data is valid
> utf8 (or cp1252 if we want to get fancy and combine it with "fallback")
> you get an encoding header, but having tried that we'll just write
> whatever raw data we found, but *not* munge it.

For utf-8 specifically, the "legacy" strategy achieves this effect
with less overhead: It copies the bytes over raw, and in git the
implied encoding is utf-8. So if the bytes were utf-8 in the first
place, then they're good (we didn't need to munge them in any way),
and if they weren't utf-8 we copied them over anyway, which is the
"tried and failed" behavior you propose also.

For cp1252 or another encoding, the behavior you propose is an
enhancement/modification of the current "fallback" behavior, I guess:
Currently if the fallback encoding doesn't wash, any remaining "bad
bytes" get replaced out of existence. This leads to data loss of those
individual bytes, and it also means that if the data really wasn't
cp1252 in the first place, you might just have garbled up the data
something awful. Of course, you might have done that without any
errors anyway - cp1252 only leaves 4 unmapped bytes, so most arbitrary
text data will be successfully "interpreted", no matter how
erroneously.

The advantage of the current "replace" behavior is that you *always*
end up with valid utf-8 data in your commit text. The disadvantage is
that you can suffer (minor) unrecoverable data loss.

I think your proposal is that (optionally?) the fallback-or-raw
behavior would simply spit out / leave the original bytes in this
situation, not making any attempt to interpret them or convert them to
utf-8, but simply bring them to git as-is. This would make that
particular commit message "invalidly encoded", in the sense that any
git client or git-caller would need to "do what it can" with that
sequence of bytes.

This tradeoff between avoidance of data loss, and type/encoding
coherence, is not one I'm comfortable deciding on. I would ideally
prefer a third route, where the data *is* interpreted and converted,
but in a fully reversible way.

What would you think of a scheme where, *if* the fallback encoding
fails to decode successfully, we simply take all x80+ bytes and escape
them to a form like "\x8c", so a commit message might end up like
"solve the problem with the japanese character: \8f\c4."? Would this
way of preserving the bytes, without breaking out of having a known
(utf-8) encoding in the git commits, make sense to you? We could even
add a suffix to the message like "[original data contained bytes that
could not be mapped in targeted encoding cp1252; bytes at or over x80
were escaped as \xNN, and backslashes were escaped as \\ to avoid
ambiguity]", or something less horrendously verbose :)

>
> I haven't worked with p4, but having done some legacy SCM imports it was
> nice to be able to map data 1=1 to git in those "odd encoding" cases, or
> even cases where there was raw binary in a commit message or whatever...
>

My problem here, again, is that these "badly encoded commit messages"
endure forever in your commit history: any tool that wants to parse
your history will have to deal with them, skirt around them, etc. That
just seems like bad discipline. If the intent is to ensure that you
*can* reconstruct those bytes if/when you need to, we should find a
way to store them safely, in a way that won't randomly trip others up.

> Anyway, all of this is fine with me as-is, I just had a drive-by
> question after some admittedly not very careful reading, sorry.

Thanks for looking into it!