> On 15 Dec 2017, at 10:58, Jeff King <peff@xxxxxxxx> wrote: > > On Mon, Dec 11, 2017 at 04:50:23PM +0100, lars.schneider@xxxxxxxxxxxx wrote: > >> From: Lars Schneider <larsxschneider@xxxxxxxxx> >> >> Git and its tools (e.g. git diff) expect all text files in UTF-8 >> encoding. Git will happily accept content in all other encodings, too, >> but it might not be able to process the text (e.g. viewing diffs or >> changing line endings). >> >> Add an attribute to tell Git what encoding the user has defined for a >> given file. If the content is added to the index, then Git converts the >> content to a canonical UTF-8 representation. On checkout Git will >> reverse the conversion. > > This made me wonder what happens when you have a file that _was_ utf16, > and then you later convert it to utf8 and add a .gitattributes entry. > > I tried this with your patch: > > git init repo > cd repo > > echo foo | iconv -t utf16 >file > git add file > git commit -m utf16 > > echo 'file encoding=utf16' >.gitattributes > touch file ;# make it stat-dirty > git commit -m convert > > git checkout HEAD^ > > That works OK, because we try to read the attributes from the > destination tree for a checkout. If you do this: > > echo 'file encoding=utf16' >.git/info/attributes > git checkout HEAD^ > > then we get: > > warning: failed to encode 'file' from utf-8 to utf16 > > At least it figured out that it couldn't convert the content. It's > slightly troubling that it would try in the first place, though; are > there encoding pairs where we might accidentally generate nonsense? At this point we interpret utf-16 content as utf-8 and try to convert it to utf-16. That of course fails because utf-16 content is no valid utf-8. How could we stop trying that? How could Git possibly know what kind of encoding is used (apart from our new hint in gitattributes)? I asked myself the question about nonsense encoding pairs, too. That was the reason I added the "X -> UTF-8 -> Y && X == Y" check on Git add. However, with ".git/info/attributes" you could circumvent that check and probably generate nonsense. > It _should_ be uncommon, I think, to have a repo-level attribute set > like that. It's very common for us to use whatever happens to be in the > checked-out .gitattributes for some attributes (e.g., when doing a diff > of an older commit), but I think for checkout-related ones it's not. So > I think it may generally work in practice. And certainly the line-ending > code would share any similar problems, but at least there the result is > less confusing than mojibake. > > Playing around, I also managed to do this: > > echo 'file encoding=utf16' >.gitattributes > echo foo >file > > # I did these with an old version of git that didn't > # support the new attribute, so they blindly added the utf8 content. > git add . > git commit -m convert > > git.compile checkout HEAD^ > > which yielded: > > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > Now obviously my situation is somewhat nonsense. I was trying to force > the in-repo representation to utf8, but ended up with a mismatched > working tree file. But what's somewhat troubling is that I couldn't > checkout _away_ from that state due to the die() in convert_to_git(). > Which is in turn just there as part of refresh_index(). > > And indeed, other commands hit the same problem: > > $ git.compile diff > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > $ git.compile checkout -f > fatal: encoding 'file' from utf16 to utf-8 and back is not the same > > It may make sense to die() during "git add ." (since we're actually > changing the index entry, and we don't want to put nonsense into a > tree). But I'm not sure it's the best thing for operations which just > want to read the content. For them, perhaps it would be more appropriate > to issue a warning and return the untouched content. Absolutely! Thanks for spotting this. I will try to run die() only on "git add" in v2. - Lars