On 12/22/2019 4:30 AM, Jeff King wrote: > On Fri, Dec 20, 2019 at 10:05:11PM +0000, Garima Singh via GitGitGadget wrote: > >> Adopting changed path bloom filters has been discussed on the list before, >> and a prototype version was worked on by SZEDER Gábor, Jonathan Tan and Dr. >> Derrick Stolee [1]. This series is based on Dr. Stolee's approach [2] and >> presents an updated and more polished RFC version of the feature. > > Great to see progress here. I probably won't have time to review this > carefully before the new year, but I did notice some low-hanging fruit > on the generation side. > > So here are a few patches to reduce the CPU and memory usage. They could > be squashed in at the appropriate spots, or perhaps taken as inspiration > if there are better solutions (especially for the first one). I tested these patches with the Linux kernel repo and reported my results on each patch. However, I wanted to also test on a larger internal repo (the AzureDevOps repo), which has ~500 commits with more than 512 changes, and generally has larger diffs than the Linux kernel repo. | Version | Time | Memory | |----------|--------|--------| | Garima | 16m36s | 27.0gb | | Peff 1 | 6m32s | 28.0gb | | Peff 2 | 6m48s | 5.6gb | | Peff 3 | 6m14s | 4.5gb | | Shortcut | 3m47s | 4.5gb | For reference, I found the time and memory information using "/usr/bin/time --verbose" in a bash script. > I think we could go further still, by actually doing a non-recursive > diff_tree_oid(), and then recursing into sub-trees ourselves. That would > save us having to split apart each path to add the leading paths to the > hashmap (most of which will be duplicates if the commit touched "a/b/c" > and "a/b/d", etc). I doubt it would be that huge a speedup though. We > have to keep a list of the touched paths anyway (since the bloom key > parameters depend on the number of entries), and most of the time is > almost certainly spent inflating the trees in the first place. However > it might be easier to follow the code, and it would make it simpler to > stop traversing at the 512-entry limit, rather than generating a huge > diff only to throw it away. By "Shortcut" in the table above, I mean the following patch on top of Garima's and Peff's changes. It inserts a max_changes option into struct diff_options to halt the diff early. This seemed like an easier change than creating a new tree diff algorithm wholesale. Thanks, -Stolee -->8-- From: Derrick Stolee <dstolee@xxxxxxxxxxxxx> Date: Fri, 27 Dec 2019 10:13:48 -0500 Subject: [PATCH] diff: halt tree-diff early after max_changes When computing the changed-paths bloom filters for the commit-graph, we limit the size of the filter by restricting the number of paths in the diff. Instead of computing a large diff and then ignoring the result, it is better to halt the diff computation early. Create a new "max_changes" option in struct diff_options. If non-zero, then halt the diff computation after discovering strictly more changed paths. This includes paths corresponding to trees that change. Use this max_changes option in the bloom filter calculations. This reduces the time taken to compute the filters for the Linux kernel repo from 2m50s to 2m35s. For a larger repo with more commits changing many paths, the time reduces from 6 minutes to under 4 minutes. Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> --- bloom.c | 4 +++- diff.h | 5 +++++ tree-diff.c | 5 +++++ 3 files changed, 13 insertions(+), 1 deletion(-) diff --git a/bloom.c b/bloom.c index ea77631cc2..83dde2378b 100644 --- a/bloom.c +++ b/bloom.c @@ -155,6 +155,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, struct bloom_filter_settings settings = DEFAULT_BLOOM_FILTER_SETTINGS; int i; struct diff_options diffopt; + int max_changes = 512; filter = bloom_filter_slab_at(&bloom_filters, c); @@ -171,6 +172,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, repo_diff_setup(r, &diffopt); diffopt.flags.recursive = 1; + diffopt.max_changes = max_changes; diff_setup_done(&diffopt); if (c->parents) @@ -179,7 +181,7 @@ struct bloom_filter *get_bloom_filter(struct repository *r, diff_tree_oid(NULL, &c->object.oid, "", &diffopt); diffcore_std(&diffopt); - if (diff_queued_diff.nr <= 512) { + if (diff_queued_diff.nr <= max_changes) { struct hashmap pathmap; struct pathmap_hash_entry* e; struct hashmap_iter iter; diff --git a/diff.h b/diff.h index 6febe7e365..9443dc1b00 100644 --- a/diff.h +++ b/diff.h @@ -285,6 +285,11 @@ struct diff_options { /* Number of hexdigits to abbreviate raw format output to. */ int abbrev; + /* If non-zero, then stop computing after this many changes. */ + int max_changes; + /* For internal use only. */ + int num_changes; + int ita_invisible_in_index; /* white-space error highlighting */ #define WSEH_NEW (1<<12) diff --git a/tree-diff.c b/tree-diff.c index 33ded7f8b3..16a21d9f34 100644 --- a/tree-diff.c +++ b/tree-diff.c @@ -434,6 +434,9 @@ static struct combine_diff_path *ll_diff_tree_paths( if (diff_can_quit_early(opt)) break; + if (opt->max_changes && opt->num_changes > opt->max_changes) + break; + if (opt->pathspec.nr) { skip_uninteresting(&t, base, opt); for (i = 0; i < nparent; i++) @@ -518,6 +521,7 @@ static struct combine_diff_path *ll_diff_tree_paths( /* t↓ */ update_tree_entry(&t); + opt->num_changes++; } /* t > p[imin] */ @@ -535,6 +539,7 @@ static struct combine_diff_path *ll_diff_tree_paths( skip_emit_tp: /* ∀ pi=p[imin] pi↓ */ update_tp_entries(tp, nparent); + opt->num_changes++; } } -- 2.25.0.rc0