Re: Feature request: better error messages when UTF-8 bites

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2022-07-28 01:42, Johannes Sixt wrote:
Am 27.07.22 um 22:21 schrieb CH:
Somehow when copying and pasting a commit from a website to the command
line, a UTF-8 Byte Order Mark (BOM)
[https://en.wikipedia.org/wiki/Byte_order_mark] was appended to one of
the commit ids.  BOMs are invisible, as are many other UTF-8 code
points.  The upshot was that Git didn't like it, and complained bitterly:

$ strace -etrace=execve -s 200 git diff
038179704f0066aa815d5429221cf381ff4ef289
47346a462d8ba40b9a8b073e351c362522c46aa6

execve("/usr/bin/git", ["git", "diff",
"038179704f0066aa815d5429221cf381ff4ef289\357\273\277",
"47346a462d8ba40b9a8b073e351c362522c46aa6"], 0x7fffec3c4bb0 /* 80 vars
*/) = 0

fatal: ambiguous argument '038179704f0066aa815d5429221cf381ff4ef289':
unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
+++ exited with 128 +++
Feature request:
================

When printing the "fatal: ambiguous argument '......': ....", perhaps
escape (url or otherwise) the ambiguous argument when printing it in the
error message, or maybe add a sentence about non-ASCII characters being
found.
That's not going to fly, IMHO, because when I type

    git diff todo/René

I would not want to see

fatal: ambiguous argument 'todo/Ren\303\251': unknown ...

This is actually already MUCH better that the OP's example. In his example he has a string that looks like a 40-char hash, and git complains without showing any of the Unicode gibberish attached to that sha1. It would be better if at least it printed something in ASCII with escaped bytes in the error message.

Moreover this isn't even close to the issue above - he's talking about a no-op, non-printing Unicode marker that crept in. While I do think it shouldn't be an issue, it shouldn't even have been passed to git. IMHO it should have been stripped by the browser itself on copy, or by the terminal on paste... FWIW I'm using rxvt-unicode, and copying this from the terminal doesn't copy the marker but pasting the marker copied from Chrome is passed on to bash and git.

NB: I also though what if the shell handled it, but that isn't even really a character so not technically suitable for $IFS, and even if we considered that option it wouldn't really play well with POSIX's definition of $IFS - how to tell for example between a single Unicode codepoint and a list of binary characters? There is just no definition of wide chars for $IFS, not in POSIX nor in recent versions of Bash AFAIK.


TL;DR; the issue is IMHO on the browser side, which shouldn't include the marker in the copied text, or maybe on the terminal, BUT when passed on to git it should at least print the escaped Unicode chars in the error, otherwise it's just too confusing for the user.


BTW you actually raise another issue - I do think for file paths git could either recompose (NFC) or decompose (NFD) the strings on storage and comparison (which should probably be an option... the current default for 2.30.2 is to treat them and print them as binary (escaped on print). Consider the following when using core.quotePath=false:

$ touch "nfc_$(printf '\xf4')"
$ touch "nfd_$(printf '\x6f\xcc\x82')"
$ git add nf[cd]*
$ git status
On branch test
Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
    new file:   nfc_ô
    new file:   nfd_ô

I'm not sure how the Unicode will be translated here, it might depend on the mail client if they even's get sent as-is, but both shows the exact same file name, one in NFD and one in NFC format.

Both are canonically equivalent and reversible. It appears MacOS already decompose (NFD?) filenames by default and git provides an option to recompose the characters (core.precomposeUnicode) which, according to the manual, is not even usable on Linux...

More on Unicode normalization: https://unicode.org/reports/tr15/

--
Thomas



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux