Re: Cross-Platform Version Control

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Wed, 13 May 2009, Linus Torvalds wrote:
> 
> utf-8 normalization was one goal, and shouldn't be _that_ hard to do. But 
> quite frankly, the index is only part of it, and probably not the worst 
> part.
> 
> The real pain of filename handling is all the "read tree recursively with 
> readdir()" issues. Along with just an absolute sh*t-load of issues about 
> what to do when people ended up using different versions of the "same" 
> name in different branches.

Btw, if people care mainly just about OS X, and don't worry so much about 
case, but about the idiotic and insane OS X behavior of turning UTF-8 
filenames into that crazy NFD format, here's a simple patch that may be 
useful for that.

There _will_ certainly be other places, but this handles the one big case 
of "read_directory_recursive()", and can turn NFD into the sane NFC 
format.

Since OS X will then accept NFC (and internally turn it back to NFD) when 
you pass them as filenames, that means that converting the other way is 
not necessary.

NOTE NOTE NOTE! This really just handles one case, and is not enough for 
any kind of general case. For example, it does NOT handle the case where 
you do

	git add filename_with_åäö

explicitly, because if the "filename_with_åäö" is done using NFD 
(tab-completion etc), now git won't _match_ it with the filename it reads 
using readdir() any more (which got converted to NFC), so at a minimum 
we'd need to do that crazy NFD->NFC conversion in all the pathspecs too. 

See "get_pathspec()" in setup.c for that latter case.

But with that, and this crazy thing, OS X users might be already a lot 
better off. Totally untested, of course. 

Oh, and somebody needs to fill in that 

	convert_name_from_nfd_to_nfc()

implementation. It's designed so that if it notices that the string is 
just plain US-ASCII, it can return 0 and no extra work is done. That, in 
turn, can easily be done by some simple and efficient pre-processign that 
checks that there are no high bits set (on a 64-bit platform, do it 8 
characters at a time with a "& 0x8080808080808080"), so that the common 
case doesn't need to have barely any overhead at all.

Use <stringprep.h> and stringprep_utf8_nfkc_normalize() or something to do 
the actual normalization if you find characters with the high bit set. And 
since I know that the OS X filesystems are so buggy as to not even do that 
whole NFD thing right, there is probably some OS-X specific "use this for 
filesystem names" conversion function.

Hmm. Anybody want to take this on? It really shouldn't be too complex to 
get it working for the common case on just OS X. It's really the case 
sensitivity that is the biggest problem, if you ignore that for now, the 
problem space is _much_ smaller.

In other words, I think we can reasonably easily support a subset of 
_common_ issues with some trivial patches like this. But getting it right 
in _all_ the cases is going to be much more work (there are lots of other 
uses of "readdir()" too, this one just happens to be one of the more 
central ones).

Of course, it probably makes sense to have a whole "git_readdir()" that 
does this thing in general. That "create_full_path()" thing makes sense 
regardless, though, in that it also simplifies a lot of "baselen+len" 
usage in just "len".

		Linus

---
 dir.c |   40 ++++++++++++++++++++++++++++++++--------
 1 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/dir.c b/dir.c
index 6aae09a..4cbfc24 100644
--- a/dir.c
+++ b/dir.c
@@ -566,6 +566,30 @@ static int get_dtype(struct dirent *de, const char *path)
 }
 
 /*
+ * Take the readdir output, in (d_name,len), and append it to
+ * our base name in (fullname,baselen) with any required
+ * readdir fs->internal translation.
+ *
+ * Put the result in 'fullname', and return the final length.
+ *
+ * Right now we have no translation, and just do a memcpy()
+ * (the +1 is to copy the final NUL character too).
+ */
+static int create_full_path(char *fullname, int baselen, const char *d_name, int len)
+{
+#ifdef OS_X_IS_SOME_CRAZY_SHxAT
+	char temp[256], nlen;
+	nlen = convert_name_from_nfd_to_nfc(d_name, len, temp, sizeof(temp));
+	if (nlen) {
+		len = nlen;
+		d_name = temp;
+	}
+#endif
+	memcpy(fullname + baselen, d_name, len + 1);
+	return baselen + len;
+}
+
+/*
  * Read a directory tree. We currently ignore anything but
  * directories, regular files and symlinks. That's because git
  * doesn't handle them at all yet. Maybe that will change some
@@ -595,15 +619,15 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			/* Ignore overly long pathnames! */
 			if (len + baselen + 8 > sizeof(fullname))
 				continue;
-			memcpy(fullname + baselen, de->d_name, len+1);
-			if (simplify_away(fullname, baselen + len, simplify))
+			len = create_full_path(fullname, baselen, de->d_name, len);
+			if (simplify_away(fullname, len, simplify))
 				continue;
 
 			dtype = DTYPE(de);
 			exclude = excluded(dir, fullname, &dtype);
 			if (exclude && (dir->flags & DIR_COLLECT_IGNORED)
-			    && in_pathspec(fullname, baselen + len, simplify))
-				dir_add_ignored(dir, fullname, baselen + len);
+			    && in_pathspec(fullname, len, simplify))
+				dir_add_ignored(dir, fullname, len);
 
 			/*
 			 * Excluded? If we don't explicitly want to show
@@ -630,9 +654,9 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			default:
 				continue;
 			case DT_DIR:
-				memcpy(fullname + baselen + len, "/", 2);
+				memcpy(fullname + len, "/", 2);
 				len++;
-				switch (treat_directory(dir, fullname, baselen + len, simplify)) {
+				switch (treat_directory(dir, fullname, len, simplify)) {
 				case show_directory:
 					if (exclude != !!(dir->flags
 							& DIR_SHOW_IGNORED))
@@ -640,7 +664,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 					break;
 				case recurse_into_directory:
 					contents += read_directory_recursive(dir,
-						fullname, fullname, baselen + len, 0, simplify);
+						fullname, fullname, len, 0, simplify);
 					continue;
 				case ignore_directory:
 					continue;
@@ -654,7 +678,7 @@ static int read_directory_recursive(struct dir_struct *dir, const char *path, co
 			if (check_only)
 				goto exit_early;
 			else
-				dir_add_name(dir, fullname, baselen + len);
+				dir_add_name(dir, fullname, len);
 		}
 exit_early:
 		closedir(fdir);
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]