Re: [PATCH V4] git on Mac OS and precomposed unicode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



[Pinging Nguyen who has worked rather extensively on the start-up sequence
for ideas.]

Torsten Bögershausen <tboegi@xxxxxx> writes:

I'll try to reword the log message a bit below.

> When a file called "LATIN CAPITAL LETTER A WITH DIAERESIS" (in utf-8
> encoded as 0xc3 0x84) is created, the Mac OS filesystem converts
> "precomposed unicode" into "decomposed unicode". readdir() will return
> 0x41 0xcc 0x88 for such a file, that does not match what the caller
> thought it created.
>
> To work around this braindamage, allow git on Mac OS to optionally use a
> wrapper for readdir() that converts decomposed unicode back into the
> precomposed form, which most other platforms use natively. This makes it
> easier for Mac OS users to work together on the same project with people
> on other platforms (Note that not all Windows versions support UTF-8
> yet. Msysgit needs the unicode branch, cygwin supports UTF-8 since
> 1.7). This allows sharing git repositories stored on a VFAT file system
> (e.g. a USB stick), and mounted network share using samba.
>
> This new feature is controlled by setting a new configuration variable
> "core.precomposedunicode" to "true". Unless the variable is set to "true",
> Git on Mac OS behaves exactly as before, for backward compatiblity.
>
> The code in compat/precomposed_utf8.c implements basically 4 new
> functions: precomposed_utf8_opendir(), precomposed_utf8_readdir(),
> precomposed_utf8_closedir() precompose_argv()
>
> In order to prevent that ever a file name in decomposed unicode is
> entering the index, a "brute force" attempt is taken: all arguments into
> git (argv[1]..argv[n]) are converted into precomposed unicode.  This is
> done in git.c by calling precompose_argv().  This function is actually a
> #define, and it is only defined under Mac OS.  Nothing is converted on
> any other platforms.

It may be just me, but the above looks more in line with the usual style
of writing in our existing log messages.

Is this UTF-8 decomposition only an issue on HFS+, or does it happen on
any filesystem mounted on a MacOS box? If the former, then the second line
of the first paragraph needs further rephrasing, e.g. "... is created,
HFS+, the primary filesystem on the Mac OS, converts ...".

> Auto sensing:
> When creating a new git repository with "git init" or "git clone",
> "core.precomposedunicode" will be set "false".
>
> The user needs to activate this feature manually.
> She typically sets core.precomposedunicode to "true" on HFS and VFAT,
> or file systems mounted via SAMBA onto a Linux box.

I am not sure about this design decision.

I agree that it is prudent to introduce a new feature disabled by default,
and I can understand that you tried to make the feature more discoverable
by setting it explicitly to "false".

But I do not think it is a good idea. If a user is on MacOS and has only
HFS+, then it would be more convenient to have the configuration set to
true in $HOME/.gitconfig once and for all, to affect all repositories on
the box. "git init" dropping the explicit "false" to any new repositories
defeats that.

Wouldn't it make more sense if your "git init" did it this way?

    * Do not do anything, if you know core.precomposedunicode is already
      set (in /etc/gitconfig or $HOME/.gitconfig);

    * Otherwise, if the "probe" says "yes, we are on HFS+", issue an
      advice message to suggest the user to set it either in the
      repository specific .git/config or in $HOME/.gitconfig file.

> +core.precomposedunicode::
> +	This option is only used by Mac OS implementation of git.
> +	When core.precomposedunicode=true,
> +	git reverts the unicode decomposition of filenames done by Mac OS.
> +	This is useful when pulling/pushing from repositories containing utf-8
> +	encoded filenames using precomposed unicode (like Linux).

I would imagine that if the caller of creat(2) named the path in the
decomposed form, Mac OS would store it unaltered; strictly speaking, we
shouldn't say "reverts". How about:

    When set to true, pathnames in decomposed UTF-8 read from the
    filesystem are converted to precomposed UTF-8 before they are used by
    Git, to improve interoperability with other platforms.

> +void precompose_argv(int argc, const char **argv)
> +{
> +	int i = 0;
> +	const char *oldarg;
> +	char *newarg;
> +	iconv_t ic_precompose;
> +
> +	git_config(precomposed_unicode_config, NULL);

As the first thing called after main(), I still doubt this is a safe thing
to do (Pinging Nguyen who has worked rather extensively on the start-up
sequence for ideas). This is ifdefed away and will not break things on
other platforms, which may make it even harder to diagnose breakages.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]