On Tue, 2011-12-13 at 12:59 -0500, Jeff King wrote: > It looks like we already have a check for is_utf8, and this is not > failing that check. I guess because is_utf8 takes a NUL-terminated > buffer, so it simply sees the truncated result (i.e., depending on > endianness, "foo" in utf16 is something like "f\0o\0o\0", so we check > only "f"). We could make is_utf8 take a length parameter to be more > accurate, and then it would catch this. > > However, I think that's not quite what we want. We only check is_utf8 if > the encoding field is not set. And really, we want to reject NULs no > matter _which_ encoding they've set, because git simply doesn't handle > them properly. I had already started experimenting with automatically detecting decent UTF-16 a long while back so that compatible platforms could handle it appropriately in terms of creating diffs and dealing with newline munging between platforms. There is no 100% sure-fire check for UTF-16 if you don't already suspect it is possibly UTF-16. If we really want to check for possible UTF-16 specifically I can scrape out the check I wrote up and send it along. The is_utf8 check was not written to detect 100% valid UTF-8 per-se. It seems to me that it was written as part of the "is this a binary or not" check in the add/commit path. I have thought for some time that specifying buffer length in that whole code path would be a good idea (but I thought that somebody else had taken up that battle while I was busy dealing with other problems elsewhere), if for no other reason it would force it to deal with NULs more intelligently. -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html