Re: [RFD] Handling of non-UTF8 data in gitweb

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Dec 04, 2011 at 05:09:30PM +0100, Jakub Narebski wrote:

> The correct solution would be of course to respect `gui.encoding`
> per-repository config variable, and `encoding` gitattribute...
> though the latter is hampered by the fact that there is currently
> no way to read attribute with "git check-attr" from a given tree:
> think of a diff of change of encoding of a file!

We deal with the same problem at GitHub.

There really isn't a good way to specify per-file encodings. Something
like gui.encoding is too coarse. As you mentioned, we don't do per-tree
gitattribute lookups, so the encoding attribute has problems when the
encoding of a file changes. But even if we implemented them, you still
have the problem of getting a raw sha1 (e.g., git diff 9624865 e0a3260).
There's no way to look up attributes for that.

It would be nice if you could put an "encoding" header into the blob
object. You could use the .gitattributes in place at "git add" time to
set it. And then at lookup time, you either have the encoding, or you
assume it's in utf8 (if it isn't binary, of course).

But there's no room in the blob format for headers; the content starts
right after the size header.

You can get around this by searching the history for a tree that
contains the blob, and then checking the gitattributes. It's expensive,
but you could build a cache over time. However, it's not guaranteed to
provide a single answer; you could have multiple trees that mention the
blobs, each with different attributes.

And even if you implement all that, we have the problem that older blobs
won't have gotten an encoding header, even if they would have under the
new rules. So rather than assuming utf8, you have to make a guess
anyway.

At GitHub, we talked about a lot of these options and ended up just
using an encoding-detection library to make a best guess. It seems to
work well in practice, but it's only been deployed for a couple of
months.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]