On Fri, Dec 27, 2019 at 11:11:37AM -0500, Derrick Stolee wrote: > > So here are a few patches to reduce the CPU and memory usage. They could > > be squashed in at the appropriate spots, or perhaps taken as inspiration > > if there are better solutions (especially for the first one). > > I tested these patches with the Linux kernel repo and reported my results > on each patch. However, I wanted to also test on a larger internal repo > (the AzureDevOps repo), which has ~500 commits with more than 512 changes, > and generally has larger diffs than the Linux kernel repo. > > | Version | Time | Memory | > |----------|--------|--------| > | Garima | 16m36s | 27.0gb | > | Peff 1 | 6m32s | 28.0gb | > | Peff 2 | 6m48s | 5.6gb | > | Peff 3 | 6m14s | 4.5gb | > | Shortcut | 3m47s | 4.5gb | > > For reference, I found the time and memory information using > "/usr/bin/time --verbose" in a bash script. Thanks for giving it more exercise. My heap profiling was done with massif, which measures the heap directly. Measuring RSS would cover that, but will also include the mmap'd packfiles. That's probably why your linux.git numbers were slightly higher than mine. (massif is a really great tool if you haven't used it, as it also shows which allocations were using the memory. But it's part of valgrind, so it definitely doesn't run on native Windows. It might work under WSL, though. I'm sure there are also other heap profilers on Windows). > By "Shortcut" in the table above, I mean the following patch on top of > Garima's and Peff's changes. It inserts a max_changes option into struct > diff_options to halt the diff early. This seemed like an easier change > than creating a new tree diff algorithm wholesale. Yeah, I'm not opposed to a diff feature like this. But be careful, because... > diff --git a/diff.h b/diff.h > index 6febe7e365..9443dc1b00 100644 > --- a/diff.h > +++ b/diff.h > @@ -285,6 +285,11 @@ struct diff_options { > /* Number of hexdigits to abbreviate raw format output to. */ > int abbrev; > > + /* If non-zero, then stop computing after this many changes. */ > + int max_changes; > + /* For internal use only. */ > + int num_changes; This is holding internal state in diff_options, but the same diff_options is often used for multiple diffs (e.g., "git log --raw" would use the same rev_info.diffopt over and over again). So it would need to be cleared between diffs. There's a similar feature in the "has_changes" flag, though it looks like it is cleared manually by callers. Yuck. This isn't a problem for commit-graph right now, but: - it actually could be using a single diff_options, which would be slightly simpler (it doesn't seem to save much CPU, though, because the initialization is relatively cheap) - it's a bit of a subtle bug to leave hanging around for the next person who tries to use the feature I actually wonder if this could be rolled into the has_changes and diff_can_quit_early() feature. This really just a generalization of that feature (which is like setting max_changes to "1"). -Peff