(With reference to Message-ID: 00E0B9AC-2A2E-4F95-9B35-F3F63EBC3CF3 () gmail ! com ) "Yeah, the problem is finding someone who needs the feature _and_ is able/willing to implement it." Well, I'd like to take a stab at it. I've been thinking about how to go about it for a little over two weeks, and I'm pretty sure I've changed my mind more times than there are days in the equivalent time period. Current idea foci: * UTF-8 BOM can go to hell: I'm not going to extra effort to figure out what do to with UTF-8 as it seems to be working already for the most part. I am not planning on forcing anything to unnaturally generate a BOM when converting to UTF-8. (It sure doesn't look like iconv() does! At least not by default...) I would like to make sure that all UTF-8 I deal with is valid--and it looks like that might have been done already. * UTF-16/UTF-32 detection: Detect VALID UTF-16/UTF-32 containing files. Files that fail this test are treated as "not detected"--so they are whatever we currently detect them as being (binary, more likely than not) or say they are otherwise. My current inclination is to implement this in a separate source file & header to be included by code such as convert.c for use. Then the changes needed to convert.c will be smaller and all of it will be more sensibly contained. * The commit message is already UTF-8 and this shouldn't change, so I'm not going to mess with it. However, the utilities that display these and the diff lines as well shouldn't need to be completely re-worked if we don't need to. Therefore, I propose using UTF-8 for this (diff, etc.) even if the in-content encoding is different. This does raise the invalid UTF-8 not converting to UTF-16/UTF-32 without loss, but if we're generating invalid UTF-8 we have other problems to begin with. * Storage of UTF-16/UTF-32 content: Do we want to treat this as all other texts (raw content), normalizing to LF line ends and such (requires changes to how we handle LF/CRLF conversions)? Should we store it as UTF-8 and convert back to the source encoding on export from the object database (metadata would be required to do this correctly)? It seems to me as if there are a few things to hash out before getting beyond the point of accurate unicode [file] content text detection. Comments, suggestions, and flames (so long as BBQ sauce is provided I suppose) welcome. -- -Drew Northup ________________________________________________ "As opposed to vegetable or mineral error?" -John Pescatore, SANS NewsBites Vol. 12 Num. 59 -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html