Torsten Bögershausen <tboegi@xxxxxx> writes: > Knowing that Mac OS X writes file names as precomposed to disk, > and treats precomposed and decomposed file names as equal, git under Mac OS X > can be improved to revert the unicode decomposition of file names. The basic idea looks really good, but there are a few nits. Thanks. > The user needs to activate this feature manually. > She typically sets core.precomposedunicode to "true" on HFS and VFAT, > or file systems mounted via SAMBA onto a Linux box. What do you mean by the last part? A Linux box that mounts a filesystem from MacOS X box via cifs? Or MacOS X mounts a filesystem from a Linux box over cifs? > +core.precomposedunicode:: > + This option is only used by Mac OS implementation of git. > + When core.precomposedunicode=true, git reverts the unicode decomposition > + of filenames done by Mac OS. This is useful when sharing a repository > + between Mac OS and Linux or Windows. Why this funny inter-word spacing? > + (Git for Windows/msysGit 1.7.10 is needed, or git under cygwin 1.7). > + When false, file names are handled fully transparent by git, which means > + that file names are stored as decomposed unicode in the repository. I am assuming, after reading from this section, that the answer to my earlier question is "MacOS X that mounts over cifs from whatever file server" (i.e. the latter). It often is a good idea, after writing "X , which means that Y", if you can just drop X and go strait to Y. The result often becomes much more clear. As this section is clearly labeled as "Mac OS only", I think in this case it is OK to say "when false, file names are always stored as decomposed unicode..." for now. Given that your compat/precomposed_utf8.[ch] looks like it does not use anything MacOS X specific (other than "UTF-8-MAC" given to iconv(3) API), do you foresee that this might become useful on non Mac build of Git in the future? > diff --git a/builtin/init-db.c b/builtin/init-db.c > index 0dacb8b..06953df 100644 > --- a/builtin/init-db.c > +++ b/builtin/init-db.c > @@ -290,6 +290,8 @@ static int create_default_files(const char *template_path) > strcpy(path + len, "CoNfIg"); > if (!access(path, F_OK)) > git_config_set("core.ignorecase", "true"); > + > + probe_utf8_pathname_composition(path, len); > } Probing for case-insensitiveness and probing for UTF-8 mangling are logically related and do not deserve a separation with the extra blank line. > diff --git a/compat/precomposed_utf8.c b/compat/precomposed_utf8.c > new file mode 100644 > index 0000000..f510f21 > --- /dev/null > +++ b/compat/precomposed_utf8.c > @@ -0,0 +1,200 @@ > +#define __PRECOMPOSED_UNICODE_C__ > + > +#include "../cache.h" > +#include "../utf8.h" Do you really need "../"? I thought we compiled with -I.. from the Makefile. > +#include <stdlib.h> > +#include <string.h> > +#include <stdio.h> > +#include <stdint.h> And as you are including "cache.h" (which in turn includes "git-compat-util.h"), I doubt you would need to include these yourself, and if you do not need them, I would prefer not to see them. > +#include "precomposed_utf8.h" > + > +const static char *repo_encoding = "UTF-8"; > +const static char *path_encoding = "UTF-8-MAC"; > + > + > +/* Code borrowed from utf8.c */ > +#if defined(OLD_ICONV) || (defined(__sun__) && !defined(_XPG6)) > + typedef const char * iconv_ibp; > +#else > + typedef char * iconv_ibp; > +#endif Seeing "defined(__sun__)" here, is this not for Mac OS X-only after all? If so, I'd have to ask you to come up with a better name than "mac_os_precomposed_unicode". It is primarily about sanitizing decomposed utf-8 and it only is a happenstance that Mac OS X is the most likely platform people may need this feature on (iow, "mac_os" is not the fundamental part of this issue; having to sanitize paths of decomposed UTF-8 is). > +static char *reencode_string_iconv(const char *in, size_t insz, iconv_t conv) > +{ > + size_t outsz, outalloc; > + char *out, *outpos; > + iconv_ibp cp; > + > + outsz = insz; > + outalloc = outsz + 1; /* for terminating NUL */ > + out = xmalloc(outalloc); > + outpos = out; > + cp = (iconv_ibp)in; > + > + while (1) { > + ... > + } > + return out; > +} Shouldn't this part be using a new helper function refactored out of the utf8.c::reencode_string() function, instead of cutting and pasting? > +static size_t has_utf8(const char *s, size_t maxlen, size_t *strlen_c) > +{ > + const uint8_t *utf8p = (const uint8_t*) s; > + size_t strlen_chars = 0; > + size_t ret = 0; > + > + if ((!utf8p) || (!*utf8p)) > + return 0; > + > + while((*utf8p) && maxlen) { > + if (*utf8p & 0x80) > + ret++; > + strlen_chars++; > + utf8p++; > + maxlen--; > + } > + if (strlen_c) > + *strlen_c = strlen_chars; > + > + return ret; > +} > + > + > +void probe_utf8_pathname_composition(char *path, int len) > +{ > + const static char *auml_nfc = "\xc3\xa4"; > + const static char *auml_nfd = "\x61\xcc\x88"; > + int output_fd; > + if (mac_os_precomposed_unicode != -1) > + return; /* We found it defined in the global config, respect it */ > + path[len] = 0; > + strcpy(path + len, auml_nfc); > + output_fd = open(path, O_CREAT|O_EXCL|O_RDWR, 0600); > + if (output_fd >=0) { > + close(output_fd); > + path[len] = 0; > + strcpy(path + len, auml_nfd); > + /* Indicate the the user, that we can configure it to true */ > + if (0 == access(path, R_OK)) > + git_config_set("core.precomposedunicode", "false"); > + path[len] = 0; > + strcpy(path + len, auml_nfc); > + unlink(path); > + } Now this function has figured out if the filesystem mangles composed UTF-8 pathnames, shouldn't it flip mac_os_precomposed_unicode to either 0 or 1 before it returns? > +} > + > + > +void precompose_argv(int argc, const char **argv) > +{ > + int i = 0; > + const char *oldarg; > + char *newarg; > + iconv_t ic_precompose; > + > + if (mac_os_precomposed_unicode != 1) > + return; Could the mac_os_precomposed_unicode flag be -1 (unknown) here? Otherwise just write if (!mac_os_precomposed_unicode) return; > +PRECOMPOSED_UTF_DIR * precomposed_utf8_opendir(const char *dirname) Asterisk sticks to the identifier, i.e. PRECOMPOSED_UTF_DIR *precomposed_utf8_opendir(const char *dirname) > +{ > + PRECOMPOSED_UTF_DIR *precomposed_utf8_dir; > + precomposed_utf8_dir = xmalloc(sizeof(PRECOMPOSED_UTF_DIR)); > + > + precomposed_utf8_dir->dirp = opendir(dirname); > + if (!precomposed_utf8_dir->dirp) { > + free(precomposed_utf8_dir); > + return NULL; > + } > + precomposed_utf8_dir->ic_precompose = iconv_open(repo_encoding, path_encoding); > + if (precomposed_utf8_dir->ic_precompose == (iconv_t) -1) { > + closedir(precomposed_utf8_dir->dirp); > + free(precomposed_utf8_dir); > + return NULL; Hrm, I wonder what value the "errno" variable should be set to when this happens. > + } > + > + return precomposed_utf8_dir; > +} > + > +struct dirent * precomposed_utf8_readdir(PRECOMPOSED_UTF_DIR *precomposed_utf8_dirp) Likewise. > +{ > + struct dirent *res; > + size_t namelen = 0; > + > + res = readdir(precomposed_utf8_dirp->dirp); > + if (res && (mac_os_precomposed_unicode == 1) && has_utf8(res->d_name, (size_t)-1, &namelen)) { Likewise. > + int ret_errno = errno; > + size_t outsz = sizeof(precomposed_utf8_dirp->dirent_nfc.d_name) - 1; /* one for \0 */ A few issues: - Why (-1)? Don't you need terminating NUL at the end of the output anyway? - sizeof(d_name) is most likely incorrect (Cf. http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/dirent.h.html) - Is the reason why it is appropriate to update "res" in-place because turning a decomposed name into precomposed form will always yield a shorter result? Is that guaranteed? > + /* Copy all data except the name */ > + memcpy(&precomposed_utf8_dirp->dirent_nfc, res, > + sizeof(precomposed_utf8_dirp->dirent_nfc)-sizeof(precomposed_utf8_dirp->dirent_nfc.d_name)); > + errno = 0; > + > + cnt = iconv(precomposed_utf8_dirp->ic_precompose, &cp, &insz, &outpos, &outsz); > + if (cnt < sizeof(precomposed_utf8_dirp->dirent_nfc.d_name) -1) { s/-1)/- 1)/; Can't this iconv() fail and return -1 here, which would be smaller than the size of the structure minus one? > + *outpos = 0; > + errno = ret_errno; > + return &precomposed_utf8_dirp->dirent_nfc; > + } > + errno = ret_errno; > + } > + return res; > +} > diff --git a/compat/precomposed_utf8.h b/compat/precomposed_utf8.h > new file mode 100644 > index 0000000..79e65e7 > --- /dev/null > +++ b/compat/precomposed_utf8.h > @@ -0,0 +1,30 @@ > +#ifndef __PRECOMPOSED_UNICODE_H__ > +#include <sys/stat.h> > +#include <sys/types.h> > +#include <dirent.h> > +#include <iconv.h> The same comments for #include applies here. > +typedef struct { > + iconv_t ic_precompose; > + DIR *dirp; > + struct dirent dirent_nfc; > +} PRECOMPOSED_UTF_DIR; Note that "struct dirent" can be defined with d_name[1]; aren't you risking the memory after this structure to be overwritten with a longer name? -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html