On Wed, Dec 14, 2022 at 7:11 AM Tzadik Vanderhoof <tzadik.vanderhoof@xxxxxxxxx> wrote: > > From: Tao Klerks <tao@xxxxxxxxxx> > > Fix the handling of utf8-with-BOM files when importing changes from > > p4 to git, and introduce a test that checks it is working correctly > > I don't really understand the problem that this patch addresses, but I > feel it was incorrect to modify the behavior of git-p4 in this way. > This change has made git-p4 behave differently than the native > Perforce tools. Hi Tzadik, first of all my apologies for the fact that these changes broke your flow - it was clearly not the intent to cause any users to have to change the way they do things (or, worse, make it impossible for them to work, or cause BOM-presence inconsistencies in their repos). When you say "This change has made git-p4 behave differently than the native Perforce tools", I suspect there's a misunderstanding, but you've clearly uncovered a git-p4 "workflow bug" that this change introduced. The git-p4 tooling can be and is used for a number of different things. If I understand your usecase correctly, you keep a local git repo that you work in, and others around you continue to use p4; you (presumably) don't share your git repo with any other users. Under those single-user circumstances, it makes sense that you expect your git repo to contain "your workspace's understanding" of the perforce repo/contents, rather than "the actual contents of the repo" - and in fact, if I understand correctly, the fact that your workspace has a specific view on what's in the repo (does UTF-8 BOM stripping) means that the git-p4 "submit"/"commit" command finds unexpected differences between your git files and your p4 workspace files, when trying to submit git changes into your p4 workspace. In other use-cases where the git repo is shared between multiple users, or where a perforce-to-git migration is facilitated by this tooling, it is critical that the result of importing a perforce repo/branch into git be "faithful", retaining the actual contents of UTF-8-BOM files - including those first 3 bytes that are ignored by many text-file-processing systems, but significant to others. > > Perforce already has a "variable" (setting) to control this behavior. > If P4CHARSET is set to "utf8-bom", the BOM will be added to files in > the local workspace of type "utf8". If P4CHARSET is set to "utf8", > the BOM will NOT be stored in the local workspace file, even if its > type is "utf8"! This behavior makes sense for workstation tooling like the "p4" tool, but to the best of my knowledge it does not make sense for "repo cloning" tooling. Unless I'm misunderstanding something, git-p4 was working as you wanted before these changes, precisely *because* you had specified P4CHARSET=utf8 in your workspace (and you're the only one using your git repo, or all others set P4CHARSET the same way and have the same BOMless perspective in their worktrees), so git-p4's buggy clone/sync behavior (swallowing UTF-8 BOMs when importing repo history from perforce) matched your p4 workspace's intentionally configured behavior. There is a complicating factor here that may be clouding my understanding: Perforce has two server modes of operation, "unicode enabled server" and... non-unicode-enabled? Normal? The idea that there are utf-8 files stored without BOM (and stored, in fact, as type "text"), and utf-8 files stored with BOM (as type "utf8") is a notion of the not-unicode-enabled Perforce servers. It may be that no such distinction exists / can be stored in "unicode enabled servers" - I need to do some testing with such servers to better understand how this works. As far as I understand, you must be using a "unicode enabled server" as the p4 docs say P4CHARSET must always be unset / set to "none" on "normal" servers (and must always be set, with "unicode enabled servers", even if implicitly to "auto"). > > Whatever the problem was that motivated this change, it should have > been solved using the Perforce variable P4CHARSET, not by modifying > the behavior of git-p4. Obviously if the conflict with workflows like yours had been understood, the change would not have been made as it was. However, I'm not sure that I understand your proposal here to solve a problem "using the Perforce variable P4CHARSET". The problem was that, on standard Perforce servers, when you add a UTF-8-BOM file, it gets stored with type "utf8". When someone else syncs that file, they get back what was put in - a UTF-8-BOM file. This is the way you generally expect/hope a VCS will typically work. Conversely, if you add a UTF-8 file that does *not* have the BOM, it will be stored as type "text" - and someone syncing it will, again, get out what was put in. This is perforce client tooling working correctly/normally. git-p4, on the other hand, does not use "p4 sync", it uses "p4 print", and with this command, the BOM is *not* included on "utf8" type files. This is not dependent on P4CHARSET, it's just an aspect of the contract of "p4 print" - it expects the recipient of text-derived files to parse the p4 "type" and handle any encoding transformations (like adding a BOM) accordingly. Again, I'm not attempting to defend the breakage - just outlining why I don't see how "using the Perforce variable P4CHARSET" would solve anything. > This new behavior has made it impossible for > me to submit changes to files of type "utf8"! Any attempt fails with > "patch does not apply" and the erroneously added BOM is the cause. I will try to understand the "unicode enabled server" behavior today or tomorrow and see what options might make sense. > I propose rolling back the patch that introduced this behavior, Junio is the expert here and has noted it's a little late for that. I obviously defer to his expertise as to git's release and backout strategy. I would like to have a go at understanding what the options are (how we can get correct and functional behavior for all users), before proposing a specific course of action. > and if > that is not possible, I would like to submit a patch that adds a new > setting in the "git-p4" section that will enable users to opt out of > this new behavior (for example a boolean setting "git-p4.noUtf8Bom"). I suspect that is the easiest tactical solution, but I'd be concerned that this would require users like you (on such unicode-enabled servers, with the P4CHARSET=utf8 setting locally), to "discover" the cause of submission errors they get, and discover the setting that would enable them to work around it. At least in the medium term, I would prefer to find a way to have git-p4 intentionally respect the "P4CHARSET" setting (with a warning about repo faithfulness if you're losing any data or causing potential repo inconsistency along the way). As a workaround for you in the immediate term (and sorry, I feel super-dirty even writing this) the only option I see for "git-p4 submit" to work would be for you to, when this happens, temporarily change your P4CHARSET value to "utf8-bom" and re-sync the affected files in your workspace, before you "git-p4 submit". I have not tested this, but given my limited understanding of your situation it seems like this would work. If you already have a different workaround please let us know what it is! Best regards, Tao