We've started getting this quite a lot as we switched to a new P4 server and I suspect that the i18N options are incorrect. So I think this would be welcome. A test case would be useful, as debugging these decoding problems is a bit of a nightmare. Luke On Wed, 3 Feb 2021 at 22:42, Feiyang Xue <me@xxxxxxxxxxxxxx> wrote: > > > > On Feb 3, 2021, at 3:44 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > > "Feiyang via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > > From: Feiynag Xue <fxue@xxxxxxxx> > > P4 allows non-unicode characters in changelist description body, > so git-p4 needs to be character encoding aware when reading p4 cl > > This change adds 2 config options, one specifies encoding, > the other specifies erro handling upon unrecognized character. > Those configs apply when it reads p4 description text, mostly > from commands "p4 describe" and "p4 changes". > > Signed-off-by: Feiynag Xue <fxue@xxxxxxxx> > --- > > > Adding a few people who had meaningful (read: needs some Perforce > knowledge) changes to this part of the codebase to Cc: to ask for > their reviews. > > > Adding Yang Zhao <yang.zhao@xxxxxxxxxxxxxx>, who had made character > encodings related changes for paths. > > Adding Scott Lamb <slamb@xxxxxxxxx>, who had made changes to this > “p4CmdList()” method. > > > > git-p4: handle non-unicode characters in p4 changelist description > > P4 allows non-unicode characters in changelist description body, so > git-p4 needs to be character encoding aware when reading p4 cl. > > This change adds 2 config options: one specifies encoding, the other > specifies erro handling upon unrecognized character. Those configs apply > when it reads p4 description text, mostly from commands "p4 describe" > and "p4 changes". > > ------------------------------------------------------------------------ > > I have an open question in mind: what might be the best default config > to use? > > Currently the python's bytes.decode() is called with default utf-8 and > strict error handling, so git-p4 pukes on non-unicode characters. I > encountered it when git p4 sync attempts to ingest a certain CL. > > It seems to make sense to default to replace so that it gets rid of > non-unicode chars while trying to retain information. However, i am > uncertain on if we have use cases where it relies on the > stop-on-non-unicode behavior. (Hypothetically say an automation that's > expected to return error on non-unicode char in order to stop them from > propagating further?) > > ------------------------------------------------------------------------ > > I tested it with git p4 sync to a P4 CL that somehow has non-unicode > control character in description. With > git-p4.cldescencodingerrhandling=ignore, it proceeded without error. > > Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1 > Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1 > Pull-Request: https://github.com/gitgitgadget/git/pull/864 > > Documentation/git-p4.txt | 13 +++++++++++++ > git-p4.py | 12 +++++++++++- > 2 files changed, 24 insertions(+), 1 deletion(-) > > diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt > index f89e68b424c..01a0e0b1067 100644 > --- a/Documentation/git-p4.txt > +++ b/Documentation/git-p4.txt > @@ -638,6 +638,19 @@ git-p4.pathEncoding:: > to transcode the paths to UTF-8. As an example, Perforce on Windows > often uses "cp1252" to encode path names. > > +git-p4.clDescEncoding:: > + Perforce allows non-unicode characters in changelist description. > + Use this config to tell git-p4 what encoding Perforce had used for > + description text. This encoding is used to transcode the text to > + UTF-8. Defaults to "utf_8". > > > Would it still work if you replaced "utf_8" here with "UTF-8"? If > we can use "UTF-8", this description (and the code that does so) > would read much less awkward, I would think. > > > I doubt “UTF-8” would work; I do believe the lower case “utf-8” would. > > Looking at Python3 documentation on encodings, UTF-8 is specified as “utf_8”. > It allows aliases of using dash to replace underscore, as pointed out by the > samge page: https://docs.python.org/3/library/codecs.html#standard-encodings > > Notice that spelling alternatives that only differ in case or use a hyphen > > instead of an underscore are also valid aliases; therefore, e.g. 'utf-8’ > > is a valid alias for the 'utf_8' codec. > > I used underscore one “utf_8” for consistency reason: this file already has > uses of “utf_8”. > > > > +git-p4.clDescNonUnicodeHandling:: > + Perforce allows non-unicode characters in changelist description. > + Use this config to tell git-p4 what to do when it does not recognize > + the character encoding in description body. Defaults to "strict" for > + stopping upon encounter. "ignore" for skipping unrecognized > + characters; "replace" for attempting to convert into UTF-8. > + > git-p4.largeFileSystem:: > Specify the system that is used for large (binary) files. Please note > that large file systems do not support the 'git p4 submit' command. > diff --git a/git-p4.py b/git-p4.py > index 09c9e93ac40..abbeb9156bd 100755 > --- a/git-p4.py > +++ b/git-p4.py > @@ -206,6 +206,13 @@ def decode_path(path): > print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path)) > return path > > +def decode_changlist_description(text): > + """Decode bytes or bytearray using configured changelist description encoding options > + """ > + encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8' > + err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict' > + return text.decode(encoding, err_handling) > + > def run_git_hook(cmd, param=[]): > """Execute a hook if the hook exists.""" > if verbose: > @@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False, > for key, value in entry.items(): > key = key.decode() > if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')): > - value = value.decode() > + if key == 'desc': > + value = decode_changlist_description(value) > + else: > + value = value.decode() > decoded_entry[key] = value > # Parse out data if it's an error response > if decoded_entry.get('code') == 'error' and 'data' in decoded_entry: > > base-commit: e6362826a0409539642a5738db61827e5978e2e4 > >