> On Sun, Dec 12, 2021, 5:39 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > This summary makes sense, i.e. if the original SCM doesn't have a > declared or consistent encoding then having no "encoding" header etc. in > git likewise makes sense, and we should be trying to handle it in our > output layer. > > [Snipped from above]: > > > It's not clear to me how "attempt to detect the encoding somehow" would > > work. The first option therefore seems like the best choice. > > This really isn't possible to do in the general case, but you can get > pretty far with heuristics. > > I already submitted a patch several months ago to introduce a "p4.fallbackEncoding" option. It got merged to at least the lowest branch, but I think it died at that point. I did considerable research into the possible options at the time, and I'm pretty sure the best approach would be: Add an optional setting for the user to set the encoding. When decoding, first try UTF-8. If that succeeds, then it's almost certain that the encoding really is UTF-8. The nature of UTF-8 is that non-UTF-8 text almost never just happens to be valid when decoded as UTF-8. If that fails, use the new setting if present. This is what my patch does. I think it would be better to go beyond that, and if it fails UTF-8, and the new setting was not specified, then use some well- accepted heuristic library to detect the encoding. Frankly anything would be better than the current behavior, which is to completely crash on the first non UTF-8 character encountered (at least with Python 3).