Re: [PATCH v4 01/10] commit-graph: fix regression when computing Bloom filters

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Tue, 3 Nov 2020 11:06:29 +0530

On Sun, Oct 25, 2020 at 04:58:14PM -0400, Taylor Blau wrote:
> On Sun, Oct 25, 2020 at 01:16:48AM +0200, Jakub Narębski wrote:
> > "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> >
> > > While measuring performance with `git commit-graph write --reachable
> > > --changed-paths` on the linux repository led to around 1m40s for both
> > > HEAD and master (and could be due to fault in my measurements), it is
> > > still the "right" thing to do.
> >
> > I had to read the above paragraph several times to understand it,
> > possibly because I have expected here to be a fix for a performance
> > regression.  The commit message for 3d112755 (commit-graph: examine
> > commits by generation number) describes reduction of computation time
> > from 3m00s to 1m37s.  So I would expect performance with HEAD (i.e.
> > before those changes) to be around 3m, not the same before and after
> > changes being around 1m40s.
> >
> > Can anyone recheck this before-and-after benchmark, please?
> 
> My hunch is that our heuristic to fall back to the commits 'date'
> value is saving us here. commit_gen_cmp() first compares the generation
> numbers, breaking ties by 'date' as a heuristic. But since all
> generation number queries return GENERATION_NUMBER_INFINITY during
> writing, we're relying on our heuristic entirely.
> 
> I haven't looked much further than that, other than to see that I could
> get about a ~4sec speed-up with this patch as compared to v2.29.1 in the
> computing Bloom filters region on the kernel.
> 

Thanks for benchmarking it. I wasn't sure if I am testing it correctly
or the patch made no difference.

> > Anyway, it might be more clear to write it as the following:
> >
> >   On the Linux kernel repository, this patch didn't reduce the
> >   computation time for 'git commit-graph write --reachable
> >   --changed-paths', which is around 1m40s both before and after this
> >   change.  This could be a fault in my measurements; it is still the
> >   "right" thing to do.
> >
> > Or something like that.
> 
> Assuming that we are in fact being saved by the "date" heuristic, I'd
> probably write the following commit message instead:
> 
>   Before computing Bloom filters, the commit-graph machinery uses
>   commit_gen_cmp to sort commits by generation order for improved diff
>   performance. 3d11275505 (commit-graph: examine commits by generation
>   number, 2020-03-30) claims that this sort can reduce the time spent to
>   compute Bloom filters by nearly half.
> 
>   But since c49c82aa4c (commit: move members graph_pos, generation to a
>   slab, 2020-06-17), this optimization is broken, since asking for
>   'commit_graph_generation()' directly returns GENERATION_NUMBER_INFINITY
>   while writing.
> 
>   Not all hope is lost, though: 'commit_graph_generation()' falls
>   back to comparing commits by their date when they have equal generation
>   number, and so since c49c82aa4c is purely a date comparison function.
>   This heuristic is good enough that we don't seem to loose appreciable
>   performance while computing Bloom filters. [Benchmark that we loose
>   about ~4sec before/after c49c82aa4c9...]
> 
>   So, avoid the uesless 'commit_graph_generation()' while writing by
>   instead accessing the slab directly. This returns the newly-computed
>   generation numbers, and allows us to avoid the heuristic by directly
>   comparing generation numbers.
> 

That's a lot better, will change.

> Thanks,
> Taylor