[PATCH v2 1/3] git-p4: remove support for Python 2

Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx> · Sun, 12 Dec 2021 19:30:10 -0500

> On Sun, Dec 12, 2021, 5:39 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote:
>
> This summary makes sense, i.e. if the original SCM doesn't have a
> declared or consistent encoding then having no "encoding" header etc. in
> git likewise makes sense, and we should be trying to handle it in our
> output layer.
>
> [Snipped from above]:
>
> > It's not clear to me how "attempt to detect the encoding somehow" would
> > work.  The first option therefore seems like the best choice.
>
> This really isn't possible to do in the general case, but you can get
> pretty far with heuristics.
>
> I already submitted a patch several months ago to introduce a "p4.fallbackEncoding" option. It got merged to at least the lowest branch, but I think it died at that point.

I did considerable research into the possible options at the time, and
I'm pretty sure the best approach would be:

Add an optional setting for the user to set the encoding.

When decoding, first try UTF-8. If that succeeds, then it's almost
certain that the encoding really is UTF-8. The nature of UTF-8 is that
non-UTF-8 text almost never just happens to be valid when decoded as
UTF-8.

If that fails, use the new setting if present.

This is what my patch does.

I think it would be better to go beyond that, and if it fails UTF-8,
and the new setting was not specified, then use some well- accepted
heuristic library to detect the encoding.

Frankly anything would be better than the current behavior, which is
to completely crash on the first non UTF-8 character encountered (at
least with Python 3).