On Thu, Feb 23, 2012 at 07:17:49PM -0800, Junio C Hamano wrote: > > Thoughts? Was this intentional, or just overlooked? > > This is intentional in the sense it is not worth worrying about (I > personally consider large and binary gunk equivalent and something nobody > should care about anything more than 1. what the exact contents it had at > each point in history and 2. if it did or did not change between two > versions, but definitely not 3. how different these two versions were), > and does not deserve the complexity to support filtering and textconv'ing. We're purely in the hypothetical devil's advocate realm here, as I do not use the filtering feature myself. But what if I had a filter that canonicalized large binary files in some way (e.g., by re-zipping large zipfiles using deterministic parameters). Even though they are large binary gunk and you would never want to diff two versions, the important thing is getting an identical sha1 when the underlying data has not changed. I.e., your point 2 above, but applied to the "clean" repository versions of files. > If somebody strongly feels about lifting the limitation with a clean patch > that does not harm the common case codepaths, patches are welcome, but I > am not planning to do it myself ;-) Perhaps something like this: --- convert.c | 18 ++++++++++++++++++ convert.h | 1 + sha1_file.c | 3 ++- 3 files changed, 21 insertions(+), 1 deletion(-) diff --git a/convert.c b/convert.c index 12868ed..254301a 100644 --- a/convert.c +++ b/convert.c @@ -742,6 +742,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) } } +int want_convert_to_git(const char *path) +{ + struct conv_attrs ca; + convert_attrs(&ca, path); + + if (ca.drv && ca.drv->clean) + return 1; + if (ca.ident) + return 1; + + ca.crlf_action = input_crlf_action(ca.crlf_action, ca.eol_attr); + if (ca.crlf_action == CRLF_BINARY) + return 0; + if (ca.crlf_action == CRLF_GUESS && auto_crlf == AUTO_CRLF_FALSE) + return 0; + return 1; +} + int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst, enum safe_crlf checksafe) { diff --git a/convert.h b/convert.h index d799a16..4d5936d 100644 --- a/convert.h +++ b/convert.h @@ -36,6 +36,7 @@ extern enum eol core_eol; /* returns 1 if *dst was used */ extern int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst, enum safe_crlf checksafe); +extern int want_convert_to_git(const char *path); extern int convert_to_working_tree(const char *path, const char *src, size_t len, struct strbuf *dst); extern int renormalize_buffer(const char *path, const char *src, size_t len, diff --git a/sha1_file.c b/sha1_file.c index f9f8d5e..6c0e05c 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2720,7 +2720,8 @@ int index_fd(unsigned char *sha1, int fd, struct stat *st, if (!S_ISREG(st->st_mode)) ret = index_pipe(sha1, fd, type, path, flags); - else if (size <= big_file_threshold || type != OBJ_BLOB) + else if (size <= big_file_threshold || type != OBJ_BLOB || + want_convert_to_git(path)) ret = index_core(sha1, fd, size, type, path, flags); else ret = index_stream(sha1, fd, size, type, path, flags); There should be no performance impact, as the new code only kicks in for files that exceed big_file_threshold (and even then, it is just an extra duplicate attr lookup, which the check_attr code caches anyway). I don't like repeating all of the convert_to_git policy logic. Perhaps if you pass a NULL buffer to convert_to_git, it should run through its usual logic, stopping just short of actually writing anything, and return a flag indicating whether it _would_ convert (this can't be 100% accurate, as sometimes conversion depends on looking at the actual contents of the buffer, but it could at least tell us "yes, I might convert" versus "no, I will definitely not convert"). -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html