Re: [PATCH] git-p4: handle non-unicode characters in p4 cl

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Feiyang via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Feiynag Xue <fxue@xxxxxxxx>
>
> P4 allows non-unicode characters in changelist description body,
> so git-p4 needs to be character encoding aware when reading p4 cl
>
> This change adds 2 config options, one specifies encoding,
> the other specifies erro handling upon unrecognized character.
> Those configs  apply when it reads p4 description text, mostly
> from commands "p4 describe" and "p4 changes".
>
> Signed-off-by: Feiynag Xue <fxue@xxxxxxxx>
> ---

Adding a few people who had meaningful (read: needs some Perforce
knowledge) changes to this part of the codebase to Cc: to ask for
their reviews.

>     git-p4: handle non-unicode characters in p4 changelist description
>     
>     P4 allows non-unicode characters in changelist description body, so
>     git-p4 needs to be character encoding aware when reading p4 cl.
>     
>     This change adds 2 config options: one specifies encoding, the other
>     specifies erro handling upon unrecognized character. Those configs apply
>     when it reads p4 description text, mostly from commands "p4 describe"
>     and "p4 changes".
>     
>     ------------------------------------------------------------------------
>     
>     I have an open question in mind: what might be the best default config
>     to use?
>     
>     Currently the python's bytes.decode() is called with default utf-8 and
>     strict error handling, so git-p4 pukes on non-unicode characters. I
>     encountered it when git p4 sync attempts to ingest a certain CL.
>     
>     It seems to make sense to default to replace so that it gets rid of
>     non-unicode chars while trying to retain information. However, i am
>     uncertain on if we have use cases where it relies on the
>     stop-on-non-unicode behavior. (Hypothetically say an automation that's
>     expected to return error on non-unicode char in order to stop them from
>     propagating further?)
>     
>     ------------------------------------------------------------------------
>     
>     I tested it with git p4 sync to a P4 CL that somehow has non-unicode
>     control character in description. With
>     git-p4.cldescencodingerrhandling=ignore, it proceeded without error.
>
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-864%2Ffeiyeung%2Fdescription-text-encoding-handling-v1
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-864/feiyeung/description-text-encoding-handling-v1
> Pull-Request: https://github.com/gitgitgadget/git/pull/864
>
>  Documentation/git-p4.txt | 13 +++++++++++++
>  git-p4.py                | 12 +++++++++++-
>  2 files changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/git-p4.txt b/Documentation/git-p4.txt
> index f89e68b424c..01a0e0b1067 100644
> --- a/Documentation/git-p4.txt
> +++ b/Documentation/git-p4.txt
> @@ -638,6 +638,19 @@ git-p4.pathEncoding::
>  	to transcode the paths to UTF-8. As an example, Perforce on Windows
>  	often uses "cp1252" to encode path names.
>  
> +git-p4.clDescEncoding::
> +	Perforce allows non-unicode characters in changelist description. 
> +	Use this config to tell git-p4 what encoding Perforce had used for 
> +	description text. This encoding is used to transcode the text to
> +	UTF-8. Defaults to "utf_8".

Would it still work if you replaced "utf_8" here with "UTF-8"?  If
we can use "UTF-8", this description (and the code that does so)
would read much less awkward, I would think.

> +git-p4.clDescNonUnicodeHandling::
> +	Perforce allows non-unicode characters in changelist description. 
> +	Use this config to tell git-p4 what to do when it does not recognize 
> +	the character encoding in description body. Defaults to "strict" for 
> +	stopping upon encounter. "ignore" for skipping unrecognized
> +	characters; "replace" for attempting to convert into UTF-8. 
> +
>  git-p4.largeFileSystem::
>  	Specify the system that is used for large (binary) files. Please note
>  	that large file systems do not support the 'git p4 submit' command.
> diff --git a/git-p4.py b/git-p4.py
> index 09c9e93ac40..abbeb9156bd 100755
> --- a/git-p4.py
> +++ b/git-p4.py
> @@ -206,6 +206,13 @@ def decode_path(path):
>                  print('Path with non-ASCII characters detected. Used {} to decode: {}'.format(encoding, path))
>          return path
>  
> +def decode_changlist_description(text):
> +    """Decode bytes or bytearray using configured changelist description encoding options
> +    """
> +    encoding = gitConfig('git-p4.clDescEncoding') or 'utf_8'
> +    err_handling = gitConfig('git-p4.clDescEncodingErrHandling') or 'strict'
> +    return text.decode(encoding, err_handling)
> +
>  def run_git_hook(cmd, param=[]):
>      """Execute a hook if the hook exists."""
>      if verbose:
> @@ -771,7 +778,10 @@ def p4CmdList(cmd, stdin=None, stdin_mode='w+b', cb=None, skip_info=False,
>                  for key, value in entry.items():
>                      key = key.decode()
>                      if isinstance(value, bytes) and not (key in ('data', 'path', 'clientFile') or key.startswith('depotFile')):
> -                        value = value.decode()
> +                        if key == 'desc':
> +                            value = decode_changlist_description(value)
> +                        else:
> +                            value = value.decode()
>                      decoded_entry[key] = value
>                  # Parse out data if it's an error response
>                  if decoded_entry.get('code') == 'error' and 'data' in decoded_entry:
>
> base-commit: e6362826a0409539642a5738db61827e5978e2e4



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux