Re: [PATCH] Use FIX_UTF8_MAC to enable conversion from UTF8-MAC to UTF8

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 21 Jan 2008 20:59:56 -0800 (PST)

On Mon, 21 Jan 2008, Junio C Hamano wrote:
> 
> yet you can obtain a path component longer than 256 bytes.

Individual components are limited to 255 bytes by most filesystems 
(PATH_MAX is the whole path, not any individual component).

That said, you're right. It's not really a design requirement, and since 
you never get an array of "struct dirent", just a pointer to a single one, 
it would be perfectly normal and natural for "struct dirent" to be 
declared with a unsized d_name[].

It's also quite possible that some implementations might even have 
d_name[] not as an array, but as a pointer to somewhere else (POSIX may 
require it to be an array, I didn't check).

That said, I bet that Mark isn't the only one to have written code like 
that, so I suspect Mark's code probably works in practice pretty much 
everywhere, even if I don't think it's necessarily _required_ to work 
correctly.

I do suspect that if you really want to make this portable, and able to 
handle an expanding d_name[] too, I think you need to make sure you 
allocate a big-enough one. And if you worry about d_name perhaps being a 
pointer, that really does mean that you'd need to convert the 
system-supplied "struct dirent" into a "git_dirent_t" that you can 
control.

That said, I think this patch has a bigger problem, namely just 
fundamentally that

	char *utf8 = reencode_string(entry, "UTF8", "UTF8-MAC");

is just unbelievably slow. That's just not how it should be done.

First off, the common case is that the filename likely has everything in 
plain 7-bit ascii. So rather than re-encoding by default, the first thing 
to do is to just see if it even needs re-encoding. Even if it's as simple 
as saying "does it have any high bits at all", that's going to be a *huge* 
performance win.

So start off with something like

	int is_usascii(const char *p)
	{
		char c;

		do {
			c = *p++;
		} while (c > 0);
		return !c;
	}

and now you can do

	if (is_usascii(entry->d_name))
		return entry;

before you even *look* at re-encoding it (and this basically works for all 
cases - we really don't care about EBCDIC, do we? So even if this routine 
was meant to do Latin1<->utf8, the above "is_usascii()" test is always the 
right thing to do).

Anyway, even if you do that, our "reencode_string()" is really *so* 
expensive that you really don't want to do it on a filename by filename 
basis. It literally does a malloc() for each allocation. It might well be 
worth it to find something that is more utf-8-specific (and I could well 
imagine that Mac OS X comes with some UTF libraries, if only because we 
cannot possibly be the only people with this issue).

(Same goes for Latin1<->UTF conversion, for that matter. If somebody wants 
to add that, I suspect it's best done by hand, not using iconv and our 
rather expensive layer around it. That said, latin1->utf8 is actually 
much *easier* than utf-8 NFD->NFC).

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html