Re: [PATCH v2] [RFC] git-p4: improve encoding handling to support inconsistent encodings

Andrew Oakley <andrew@xxxxxxxxxxxxx> · Sun, 17 Apr 2022 19:11:44 +0100

On 14/04/2022 10:57, Tao Klerks wrote:
On Wed, Apr 13, 2022 at 10:41 PM Andrew Oakley <andrew@xxxxxxxxxxxxx> wrote:

On Wed, 13 Apr 2022 06:24:29 +0000
"Tao Klerks via GitGitGadget" <gitgitgadget@xxxxxxxxx> wrote:
Make the changelist-description- and user-fullname-handling code
python-runtime-agnostic, introducing three "strategies" selectable via
config:
- 'legacy', behaving as previously under python2,
- 'strict', behaving as previously under python3, and
- 'fallback', favoring utf-8 but supporting a secondary encoding when
utf-8 decoding fails, and finally replacing remaining unmappable
bytes.

I was thinking about making the config option be a list of encodings to
try.  So the options you've given map something like this:
- "legacy" -> "raw"
- "strict" -> "utf8"
- "fallback" -> "utf8 cp1252" (or whatever is configured)

This doesn't handle the case of using a replacement character, but in
reality I suspect that fallback encoding will always be a legacy 8bit
codec anyway.

I think what you've proposed is fine too, I'm not sure what would end
up being easier to understand.

I'm not sure I understand the proposal... Specifically, I don't
understand how "raw" would work in that scheme.

In "my" current scheme, there is a big difference between "legacy" and
the other two strategies: the legacy strategy reads "raw", but also
*writes* "raw".

The other schemes read whatever encoding, and then write utf-8. In the
case of strict, that actually works out the same as "raw", as long as
the input bytes were valid utf-8 (and otherwise nothing happens
anyway). In the case of fallback, that's a completely different
behavior to legacy's read-raw write-raw.

The way I look at it is that you both read and write bytes, and you may 
attempt to decode and re-encode text on the way.  Both the decoding and 
the encoding are done in metadata_stream_to_writable_bytes, so nothing 
else needs to know about the raw option being different.

Perhaps it's easier to explain with a bit of (untested) code.  I was 
thinking of something like this:

def metadata_stream_to_writable_bytes(s):
    if not isinstance(s, bytes):
        return s.encode('utf-8')

    for encoding in gitConfigList('git-p4.metadataEncoding') or 
['utf-8', 'cp1252']:
        if encoding == 'passthrough':
            return s  # do not try to correct text encoding
        else:
            try:
                return s.decode(encoding).encode('utf-8')
            except UnicodeDecodeError:
                pass  # try the next configured encoding

    raise MetadataDecodingException(s)
Is your proposal to independently specify the read encodings *and* the
write encoding, as separate parameters? That was actually my original
approach, but it turned out to, in my opinion, be harder to understand
(and implement :) )

I don't think it's important to be able specify the encoding to be used 
in git.  I've not spotted anyone asking for that feature.  I think it 
could be added later if someone needs it.

      * Does it make sense to make "fallback" the default decoding
strategy in python3? This is definitely a change in behavior, but I
believe for the better; failing with "we defaulted to strict, but you
can run again with this other option if you want it to work" seems
unkind, only making sense if we thought fallback to cp1252 would be
wrong in a substantial proportion of cases...

The only issue I can see with changing the default is that it might
lead to a surprising loss of data for someone migrating to git.  Maybe
print a warning the first time git-p4 encounters something that can't be
decoded as UTF-8, but then continue with the fallback to cp1252?

Honestly, I'm not sure how much a warning does. In my perforce repo,
for example, any new warnings during the import would get drowned out
by the mac metadata ignoring warnings.

FWIW in the perforce repository I work with this doesn't happen much and 
I would notice the additional warning about text encodings.  I suspect 
this will be another thing which varies a lot between repositories.

I understand and share the data loss concern.

As I just answered Ævar, I *think* I'd like to address the data loss
concern by escaping all x80+ bytes if something cannot be interpreted
even using the fallback encoding. In a commit message there could also
be a suffix explaining what happened, although I suspect that's
pointless complexity. The advantage of this approach is that it makes
it *possible* to reconstruct the original bytestream precisely, but
without creating badly-encoded git commit messages that need to be
skirted around.

I think this gets pretty messy though.  In my opinion it's not any nicer 
than putting the raw bytes in the commit message.

Git does not make any attempt enforce the commit metadata encoding, so I 
think that tools really should make an attempt to handle invalid data in 
a somewhat sensible fashion.

I don't think there is really a "right" answer, anything reasonable 
would be better than what we've got now.