On Mon, Jun 1, 2020 at 4:22 PM Jeff King <peff@xxxxxxxx> wrote: > Subject: [PATCH] diff: discard blob data from stat-unmatched pairs > > When performing a tree-level diff against the working tree, we may find > that our index stat information is dirty, so we queue a filepair to be > examined later. If the actual content hasn't changed, we call this a > stat-unmatch; the stat information was out of date, but there's no > actual diff. Normally diffcore_std() would detect and remove these > identical filepairs via diffcore_skip_stat_unmatch(). However, when > "--quiet" is used, we want to stop the diff as soon as we see any > changes, so we check for stat-unmatches immediately in diff_change(). > > That check may require us to actually load the file contents into the > pair of diff_filespecs. If we find that the pair isn't a stat-unmatch, > then no big deal; we'd likely load the contents later anyway to generate > a patch, do rename detection, etc, so we want to hold on to it. But if > it a stat-unmatch, then we have no more use for that data; the whole s/it/& is/ > point is that we're going discard the pair. However, we never free the > allocated diff_filespec data. > > In most cases, keeping that data isn't a problem. We don't expect a lot > of stat-unmatch entries, and since we're using --quiet, we'd quit as > soon as we saw such a real change anyway. However, there are extreme > cases where it makes a big difference: > > 1. We'd generally mmap() the working tree half of the pair. And since > the OS may limit the total number of maps, we can run afoul of this > in large repositories. E.g.: > > $ cd linux > $ git ls-files | wc -l > 67959 > $ sysctl vm.max_map_count > vm.max_map_count = 65530 > $ git ls-files | xargs touch ;# everything is stat-dirty! > $ git diff --quiet > fatal: mmap failed: Cannot allocate memory > > It should be unusual to have so many files stat-dirty, but it's > possible if you've just run a script like "sed -i" or similar. > > After this patch, the above correctly exits with code 0. > > 2. Even if you don't hit mmap limits, the index half of the pair will > have been pulled from the object database into heap memory. Again > in a clone of linux.git, running: > > $ git ls-files | head -n 10000 | xargs touch > $ git diff --quiet > > peaks at 145MB heap before this patch, and 94MB after. > > This patch solves the problem by freeing any diff_filespec data we > picked up during the "--quiet" stat-unmatch check in diff_changes. > Nobody is going to need that data later, so there's no point holding on > to it. There are a few things to note: > > - we could skip queueing the pair entirely, which could in theory save > a little work. But there's not much to save, as we need a > diff_filepair to feed to diff_filespec_check_stat_unmatch() anyway. > And since we cache the result of the stat-unmatch checks, a later > call to diffcore_skip_stat_unmatch() call will quickly skip over > them. The diffcore code also counts up the number of stat-unmatched > pairs as it removes them. It's doubtful any callers would care about > that in combination with --quiet, but we'd have to reimplement the > logic here to be on the safe side. So it's not really worth the > trouble. > > - I didn't write a test, because we always produce the correct output > unless we run up against system mmap limits, which are both > unportable and expensive to test against. Measuring peak heap > would be interesting, but our perf suite isn't yet capable of that. > > - note that diff without "--quiet" does not suffer from the same > problem. In diffcore_skip_stat_unmatch(), we detect the stat-unmatch > entries and drop them immediately, so we're not carrying their data > around. > > - you _can_ still trigger the mmap limit problem if you truly have > that many files with actual changes. But it's rather unlikely. The > stat-unmatch check avoids loading the file contents if the sizes > don't match, so you'd need a pretty trivial change in every single > file. Likewise, inexact rename detection might load the data for > many files all at once. But you'd need not just 64k changes, but > that many deletions and additions. The most likely candidate is > perhaps break-detection, which would load the data for all pairs and > keep it around for the content-level diff. But again, you'd need 64k > actually changed files in the first place. > > So it's still possible to trigger this case, but it seems like "I > accidentally made all my files stat-dirty" is the most likely case > in the real world. > > Reported-by: Jan Christoph Uhde <Jan@xxxxxxxxxx> > Signed-off-by: Jeff King <peff@xxxxxxxx>