Re: [RFC/PATCH v5] git on Mac OS and precomposed unicode

Junio C Hamano <gitster@xxxxxxxxx> · Mon, 25 Jun 2012 14:24:29 -0700

Torsten Bögershausen <tboegi@xxxxxx> writes:

> Knowing that Mac OS X writes file names as precomposed to disk,
> and treats precomposed and decomposed file names as equal, git under Mac OS X
> can be improved to revert the unicode decomposition of file names.

The basic idea looks really good, but there are a few nits.

Thanks.

> The user needs to activate this feature manually.
> She typically sets core.precomposedunicode to "true" on HFS and VFAT,
> or file systems mounted via SAMBA onto a Linux box.

What do you mean by the last part?  A Linux box that mounts a
filesystem from MacOS X box via cifs?  Or MacOS X mounts a
filesystem from a Linux box over cifs?

> +core.precomposedunicode::
> +	This option is only used by Mac OS implementation of git.
> +	When core.precomposedunicode=true, git reverts the unicode decomposition
> +	of filenames done by Mac OS. This is useful when sharing a repository
> +	between Mac OS and Linux or	Windows.

Why this funny inter-word spacing?

> +	(Git for Windows/msysGit 1.7.10 is needed, or git under cygwin 1.7).
> +	When false, file names are handled fully transparent by git, which means
> +	that file names are stored as decomposed unicode in the	repository.

I am assuming, after reading from this section, that the answer to
my earlier question is "MacOS X that mounts over cifs from whatever
file server" (i.e. the latter).  

It often is a good idea, after writing "X , which means that Y", if
you can just drop X and go strait to Y.  The result often becomes
much more clear.

As this section is clearly labeled as "Mac OS only", I think in this
case it is OK to say "when false, file names are always stored as
decomposed unicode..." for now.

Given that your compat/precomposed_utf8.[ch] looks like it does not
use anything MacOS X specific (other than "UTF-8-MAC" given to
iconv(3) API), do you foresee that this might become useful on non
Mac build of Git in the future?

> diff --git a/builtin/init-db.c b/builtin/init-db.c
> index 0dacb8b..06953df 100644
> --- a/builtin/init-db.c
> +++ b/builtin/init-db.c
> @@ -290,6 +290,8 @@ static int create_default_files(const char *template_path)
>  		strcpy(path + len, "CoNfIg");
>  		if (!access(path, F_OK))
>  			git_config_set("core.ignorecase", "true");
> +
> +		probe_utf8_pathname_composition(path, len);
>  	}

Probing for case-insensitiveness and probing for UTF-8 mangling are
logically related and do not deserve a separation with the extra
blank line.

> diff --git a/compat/precomposed_utf8.c b/compat/precomposed_utf8.c
> new file mode 100644
> index 0000000..f510f21
> --- /dev/null
> +++ b/compat/precomposed_utf8.c
> @@ -0,0 +1,200 @@
> +#define __PRECOMPOSED_UNICODE_C__
> +
> +#include "../cache.h"
> +#include "../utf8.h"

Do you really need "../"?  I thought we compiled with -I.. from the
Makefile.

> +#include <stdlib.h>
> +#include <string.h>
> +#include <stdio.h>
> +#include <stdint.h>

And as you are including "cache.h" (which in turn includes
"git-compat-util.h"), I doubt you would need to include these
yourself, and if you do not need them, I would prefer not to see
them.

> +#include "precomposed_utf8.h"
> +
> +const static char *repo_encoding = "UTF-8";
> +const static char *path_encoding = "UTF-8-MAC";
> +
> +
> +/* Code borrowed from utf8.c */
> +#if defined(OLD_ICONV) || (defined(__sun__) && !defined(_XPG6))
> +	typedef const char * iconv_ibp;
> +#else
> +	typedef char * iconv_ibp;
> +#endif

Seeing "defined(__sun__)" here, is this not for Mac OS X-only after
all?  If so, I'd have to ask you to come up with a better name than
"mac_os_precomposed_unicode". It is primarily about sanitizing
decomposed utf-8 and it only is a happenstance that Mac OS X is the
most likely platform people may need this feature on (iow, "mac_os"
is not the fundamental part of this issue; having to sanitize paths
of decomposed UTF-8 is).

> +static char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv)
> +{
> +	size_t outsz, outalloc;
> +	char *out, *outpos;
> +	iconv_ibp cp;
> +
> +	outsz = insz;
> +	outalloc = outsz + 1; /* for terminating NUL */
> +	out = xmalloc(outalloc);
> +	outpos = out;
> +	cp = (iconv_ibp)in;
> +
> +	while (1) {
> +	...
> +	}
> +	return out;
> +}

Shouldn't this part be using a new helper function refactored out of
the utf8.c::reencode_string() function, instead of cutting and
pasting?

> +static size_t has_utf8(const char *s, size_t maxlen, size_t *strlen_c)
> +{
> +	const uint8_t *utf8p = (const uint8_t*) s;
> +	size_t strlen_chars = 0;
> +	size_t ret = 0;
> +
> +	if ((!utf8p) || (!*utf8p))
> +		return 0;
> +
> +	while((*utf8p) && maxlen) {
> +		if (*utf8p & 0x80)
> +			ret++;
> +		strlen_chars++;
> +		utf8p++;
> +		maxlen--;
> +	}
> +	if (strlen_c)
> +		*strlen_c = strlen_chars;
> +
> +	return ret;
> +}
> +
> +
> +void probe_utf8_pathname_composition(char *path, int len)
> +{
> +	const static char *auml_nfc = "\xc3\xa4";
> +	const static char *auml_nfd = "\x61\xcc\x88";
> +	int output_fd;
> +	if (mac_os_precomposed_unicode != -1)
> +		return; /* We found it defined in the global config, respect it */
> +	path[len] = 0;
> +	strcpy(path + len, auml_nfc);
> +	output_fd = open(path, O_CREAT|O_EXCL|O_RDWR, 0600);
> +	if (output_fd >=0) {
> +		close(output_fd);
> +		path[len] = 0;
> +		strcpy(path + len, auml_nfd);
> +		/* Indicate the the user, that we can configure it to true */
> +		if (0 == access(path, R_OK))
> +			git_config_set("core.precomposedunicode", "false");
> +		path[len] = 0;
> +		strcpy(path + len, auml_nfc);
> +		unlink(path);
> +	}

Now this function has figured out if the filesystem mangles composed
UTF-8 pathnames, shouldn't it flip mac_os_precomposed_unicode to
either 0 or 1 before it returns?

> +}
> +
> +
> +void precompose_argv(int argc, const char **argv)
> +{
> +	int i = 0;
> +	const char *oldarg;
> +	char *newarg;
> +	iconv_t ic_precompose;
> +
> +	if (mac_os_precomposed_unicode != 1)
> +		return;

Could the mac_os_precomposed_unicode flag be -1 (unknown) here?
Otherwise just write

	if (!mac_os_precomposed_unicode)
        	return;

> +PRECOMPOSED_UTF_DIR * precomposed_utf8_opendir(const char *dirname)

Asterisk sticks to the identifier, i.e.

    PRECOMPOSED_UTF_DIR *precomposed_utf8_opendir(const char *dirname)

> +{
> +	PRECOMPOSED_UTF_DIR *precomposed_utf8_dir;
> +	precomposed_utf8_dir = xmalloc(sizeof(PRECOMPOSED_UTF_DIR));
> +
> +	precomposed_utf8_dir->dirp = opendir(dirname);
> +	if (!precomposed_utf8_dir->dirp) {
> +		free(precomposed_utf8_dir);
> +		return NULL;
> +	}
> +	precomposed_utf8_dir->ic_precompose = iconv_open(repo_encoding, path_encoding);
> +	if (precomposed_utf8_dir->ic_precompose == (iconv_t) -1) {
> +		closedir(precomposed_utf8_dir->dirp);
> +		free(precomposed_utf8_dir);
> +		return NULL;

Hrm, I wonder what value the "errno" variable should be set to when
this happens.

> +	}
> +
> +	return precomposed_utf8_dir;
> +}
> +
> +struct dirent * precomposed_utf8_readdir(PRECOMPOSED_UTF_DIR *precomposed_utf8_dirp)

Likewise.

> +{
> +	struct dirent *res;
> +	size_t namelen = 0;
> +
> +	res = readdir(precomposed_utf8_dirp->dirp);
> +	if (res && (mac_os_precomposed_unicode == 1) && has_utf8(res->d_name, (size_t)-1, &namelen)) {

Likewise.

> +		int ret_errno = errno;
> +		size_t outsz = sizeof(precomposed_utf8_dirp->dirent_nfc.d_name) - 1; /* one for \0 */

A few issues:

 - Why (-1)?  Don't you need terminating NUL at the end of the
   output anyway?

 - sizeof(d_name) is most likely incorrect (Cf.
   http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/dirent.h.html)

 - Is the reason why it is appropriate to update "res" in-place
   because turning a decomposed name into precomposed form will
   always yield a shorter result?  Is that guaranteed?

> +		/* Copy all data except the name */
> +		memcpy(&precomposed_utf8_dirp->dirent_nfc, res,
> +		       sizeof(precomposed_utf8_dirp->dirent_nfc)-sizeof(precomposed_utf8_dirp->dirent_nfc.d_name));
> +		errno = 0;
> +
> +		cnt = iconv(precomposed_utf8_dirp->ic_precompose, &cp, &insz, &outpos, &outsz);
> +		if (cnt < sizeof(precomposed_utf8_dirp->dirent_nfc.d_name) -1) {

s/-1)/- 1)/;

Can't this iconv() fail and return -1 here, which would be smaller
than the size of the structure minus one?

> +			*outpos = 0;
> +			errno = ret_errno;
> +			return &precomposed_utf8_dirp->dirent_nfc;
> +		}
> +		errno = ret_errno;
> +	}
> +	return res;
> +}

> diff --git a/compat/precomposed_utf8.h b/compat/precomposed_utf8.h
> new file mode 100644
> index 0000000..79e65e7
> --- /dev/null
> +++ b/compat/precomposed_utf8.h
> @@ -0,0 +1,30 @@
> +#ifndef __PRECOMPOSED_UNICODE_H__
> +#include <sys/stat.h>
> +#include <sys/types.h>
> +#include <dirent.h>
> +#include <iconv.h>

The same comments for #include applies here.

> +typedef struct {
> +	iconv_t ic_precompose;
> +	DIR *dirp;
> +	struct dirent dirent_nfc;
> +} PRECOMPOSED_UTF_DIR;

Note that "struct dirent" can be defined with d_name[1]; aren't you
risking the memory after this structure to be overwritten with a
longer name?
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html