Unicode text files (UTF-16 in particular, but also UTF-32)

Drew Northup <drew.northup@xxxxxxxxx> · Tue, 21 Sep 2010 15:26:28 -0400

(With reference to 
Message-ID: 00E0B9AC-2A2E-4F95-9B35-F3F63EBC3CF3 () gmail ! com )
"Yeah, the problem is finding someone who needs the feature _and_ is
able/willing to implement it."

Well, 
I'd like to take a stab at it. I've been thinking about how to go about
it for a little over two weeks, and I'm pretty sure I've changed my mind
more times than there are days in the equivalent time period.

Current idea foci:

* UTF-8 BOM can go to hell: I'm not going to extra effort to figure out
what do to with UTF-8 as it seems to be working already for the most
part. I am not planning on forcing anything to unnaturally generate a
BOM when converting to UTF-8. (It sure doesn't look like iconv() does!
At least not by default...) I would like to make sure that all UTF-8 I
deal with is valid--and it looks like that might have been done already.

* UTF-16/UTF-32 detection: Detect VALID UTF-16/UTF-32 containing files.
Files that fail this test are treated as "not detected"--so they are
whatever we currently detect them as being (binary, more likely than
not) or say they are otherwise. My current inclination is to implement
this in a separate source file & header to be included by code such as
convert.c for use. Then the changes needed to convert.c will be smaller
and all of it will be more sensibly contained.

* The commit message is already UTF-8 and this shouldn't change, so I'm
not going to mess with it. However, the utilities that display these and
the diff lines as well shouldn't need to be completely re-worked if we
don't need to. Therefore, I propose using UTF-8 for this (diff, etc.)
even if the in-content encoding is different. This does raise the
invalid UTF-8 not converting to UTF-16/UTF-32 without loss, but if we're
generating invalid UTF-8 we have other problems to begin with.

* Storage of UTF-16/UTF-32 content: Do we want to treat this as all
other texts (raw content), normalizing to LF line ends and such
(requires changes to how we handle LF/CRLF conversions)? Should we store
it as UTF-8 and convert back to the source encoding on export from the
object database (metadata would be required to do this correctly)?

It seems to me as if there are a few things to hash out before getting
beyond the point of accurate unicode [file] content text detection.

Comments, suggestions, and flames (so long as BBQ sauce is provided I
suppose) welcome.

-- 
-Drew Northup
________________________________________________
"As opposed to vegetable or mineral error?"
-John Pescatore, SANS NewsBites Vol. 12 Num. 59

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html