On Sat, Feb 10, 2018 at 05:21:05PM +0700, Duy Nguyen wrote: > > But it doesn't seem to work: > > > > $ yes | head -c $((1024*1024*1024 - 10*1024*1024)) >file > > $ git add file > > $ git commit -m one > > $ yes | head -c $((1024*1024*1024)) >file > > $ git commit -am two > > fatal: unable to generate diffstat for file > > > > What's weird is that if I run "git show --stat" on the same commit, it > > works! So there's something about how commit invokes the diff that > > doesn't let the big-file check kick in. > > I have a flashback about checking big_file_threshold in this code. I > vaguely remember doing something in this code but not sure if it was > merged. > > I finally found 6bf3b81348 (diff --stat: mark any file larger than > core.bigfilethreshold binary - 2014-08-16) so it's merged after all, > but this commit is about --stat apparently ;-) The confounding factor here is actually the break-detection that "commit" does. After running the "commit" above (which does succeed despite the "fatal", since that happens only as part of the diff summary we print), you can replicate the problem with "git show -B --stat". The break-detection code unconditionally loads the file data. Then when the stat code later checks whether it's binary, it just uses the data as-is. So that leads me to a few questions / lines of thought: 1. When we're checking binary-ness, we shouldn't assume if the data is loaded that the size is sane. We should check it against big_file_threshold. 2. Should break detection (and probably inexact rename detection) skip files that are over big_file_threshold? I think our code is capable of actually performing these operations on large files (at least on systems with 64-bit ulong; I'd be willing to bet you get funny results due to integer overflow on 32-bit systems or on 64-bit Windows). But I'm not sure it's doing anybody any good. And it's way faster not to. For example, here's "git show" before and after the patch below (running on the repo from my example): $ time git show --oneline -B --stat | cat fatal: unable to generate diffstat for file 883cbdc two real 0m10.153s user 0m9.929s sys 0m0.224s peff@sigill:~/tmp [master] $ time git.compile show --oneline -B --stat | cat 883cbdc two file | Bin 1063256064 -> 1073741824 bytes 1 file changed, 0 insertions(+), 0 deletions(-) real 0m0.008s user 0m0.002s sys 0m0.010s We can produce the useful answer in a fraction of the time, since we don't even need to load the blob content. The downside is that in theory we could break-detect these, and then find renames. So I guess it comes down to the philosophy of core.bigfilethreshold: how much effort are we willing to put into such files on the off chance that they produce a useful output? If we go this route, then in theory the fix in (1) becomes redundant, though I'd probably do both just as a belt-and-suspenders thing. 3. To what degree should we override the user's config to treat these files as binary. If I set core.bigfilethreshold to "10G", or if I use gitattributes to mark the file as non-binary, then we're still going to feed it to xdiff (which is going to choke and die). Should we override these when we approach MAX_DIFF_SIZE, and just treat the file as binary? Should we barf and tell the user to fix their config? I guess I argued for overriding attributes earlier, and that probably ought to be independent of core.bigfilethreshold. Perhaps we should issue a warning in that case, to let the user know their config is nonsense. > > I think we probably ought to consider anything over big_file_threshold > > to be binary, no matter what. Possibly even if the user gave us a > > .gitattribute that says "no really, this is text". Because that 1GB > > limit is a hard limit that the code can't cope with; our options are > > either to generate a binary diff or to die. > > Agreed. While at there perhaps we need to improve this "unable to > generate diffstat" message a bit too. One reason the message is so vague is that we don't actually have any kind of error code. Though really the only reason for xdiff to fail is due to file size. So we could perhaps offer some advice there. But if we do all the things I suggested above, then we'd automatically handle all the cases we know about. so my hope was that we would make it impossible to trigger this message. In which case it maybe ought to be a BUG(). :) Here's the patch to make "show -B --stat" work. I'll give some more thought to whether this is a good idea and prepare a series. One downside is that in the common case it causes us to look up each object twice (once to get its size, and then again to load the content). I wonder if we should have a function for "read this object, unless it's over N bytes, in which case just tell me the size". That's weirdly specific, but I think pretty much every user of core.bigfilethreshold would want it. --- diff --git a/diffcore-break.c b/diffcore-break.c index c64359f489..35f5b50bcc 100644 --- a/diffcore-break.c +++ b/diffcore-break.c @@ -61,6 +61,13 @@ static int should_break(struct diff_filespec *src, !oidcmp(&src->oid, &dst->oid)) return 0; /* they are the same */ + if (diff_populate_filespec(src, CHECK_SIZE_ONLY) || + diff_populate_filespec(dst, CHECK_SIZE_ONLY)) + return 0; /* error but caught downstream */ + + if (src->size > big_file_threshold || dst->size > big_file_threshold) + return 0; /* too big to be worth computation */ + if (diff_populate_filespec(src, 0) || diff_populate_filespec(dst, 0)) return 0; /* error but caught downstream */