On Tue, 13 Feb 2007, Junio C Hamano wrote: > > It might be safe for some definition of safe, but it is very > Asian unfriendly. > > I'd probably suggest replacing it with what GNU diff uses, which > we stolen and implemented in diff.c::mmfile_is_binary(). Well, the thing is, mmfile_is_binary() doesn't really have a big downside if it's wrong one way or the other. In contrast CR->CRLF conversion, if wrong, actually corrupts binary files. So I felt it was better to be really safe than sorry. It's *much* better to miss some CRLF translation than to do too much of it. That said, I'm sure it could be improved a lot. In particular, characters in the range 0x00 - 0x1f are clearly "more binary" than the 0x7f+ range, with the obvious exceptions (tab, cr, lf). 0x00 - which is the only one mmfile_is_binart() uses - is arguably the "most binary" one, of course, but it might be interesting to give different weights to the whole range.. In particular, especially for small files, the fact that there is no 0x00 byte in no way indicates that it's not "binary". This whole issue is obviously one reason I'd like to involve the filename itself, and make it use a ".gitattributes" file - exactly because that allows you to be much more aggressive and more precise. (0x00 may be one of the more _common_ characters in many binary files, which makes it a good character to search for too, so I don't really have any hugely strong opinions here. After all, the whole heuristic is off by default anyway, so it's "really safe" ;^) Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html