On Wed, Apr 13, 2022 at 9:04 PM Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> wrote: > > > AFAICT we only allow selecting between encodings, not "no idea what this > is, but here's some raw sequence of bytes", except by omitting the > header. > I don't understand what you mean here. My reading of the docs is that omitting the header *does not* imply "no idea what this is" - it implies "this is utf-8". Git will do the best it can (convert to utf-8 at output time if the bytes are parseable in the specified or implied encoding, and return the raw bytes otherwise), *regardless* of whether an encoding is explicitly specified or utf-8 is implied. > It seems to me that between legacy/strict/fallback there's a 4th setting > missing here. I.e. a "try-encoding". One where if your data is valid > utf8 (or cp1252 if we want to get fancy and combine it with "fallback") > you get an encoding header, but having tried that we'll just write > whatever raw data we found, but *not* munge it. For utf-8 specifically, the "legacy" strategy achieves this effect with less overhead: It copies the bytes over raw, and in git the implied encoding is utf-8. So if the bytes were utf-8 in the first place, then they're good (we didn't need to munge them in any way), and if they weren't utf-8 we copied them over anyway, which is the "tried and failed" behavior you propose also. For cp1252 or another encoding, the behavior you propose is an enhancement/modification of the current "fallback" behavior, I guess: Currently if the fallback encoding doesn't wash, any remaining "bad bytes" get replaced out of existence. This leads to data loss of those individual bytes, and it also means that if the data really wasn't cp1252 in the first place, you might just have garbled up the data something awful. Of course, you might have done that without any errors anyway - cp1252 only leaves 4 unmapped bytes, so most arbitrary text data will be successfully "interpreted", no matter how erroneously. The advantage of the current "replace" behavior is that you *always* end up with valid utf-8 data in your commit text. The disadvantage is that you can suffer (minor) unrecoverable data loss. I think your proposal is that (optionally?) the fallback-or-raw behavior would simply spit out / leave the original bytes in this situation, not making any attempt to interpret them or convert them to utf-8, but simply bring them to git as-is. This would make that particular commit message "invalidly encoded", in the sense that any git client or git-caller would need to "do what it can" with that sequence of bytes. This tradeoff between avoidance of data loss, and type/encoding coherence, is not one I'm comfortable deciding on. I would ideally prefer a third route, where the data *is* interpreted and converted, but in a fully reversible way. What would you think of a scheme where, *if* the fallback encoding fails to decode successfully, we simply take all x80+ bytes and escape them to a form like "\x8c", so a commit message might end up like "solve the problem with the japanese character: \8f\c4."? Would this way of preserving the bytes, without breaking out of having a known (utf-8) encoding in the git commits, make sense to you? We could even add a suffix to the message like "[original data contained bytes that could not be mapped in targeted encoding cp1252; bytes at or over x80 were escaped as \xNN, and backslashes were escaped as \\ to avoid ambiguity]", or something less horrendously verbose :) > > I haven't worked with p4, but having done some legacy SCM imports it was > nice to be able to map data 1=1 to git in those "odd encoding" cases, or > even cases where there was raw binary in a commit message or whatever... > My problem here, again, is that these "badly encoded commit messages" endure forever in your commit history: any tool that wants to parse your history will have to deal with them, skirt around them, etc. That just seems like bad discipline. If the intent is to ensure that you *can* reconstruct those bytes if/when you need to, we should find a way to store them safely, in a way that won't randomly trip others up. > Anyway, all of this is fine with me as-is, I just had a drive-by > question after some admittedly not very careful reading, sorry. Thanks for looking into it!