Re: [PATCH 0/5] Add struct strmap and associated utility functions

Elijah Newren <newren@xxxxxxxxx> · Fri, 28 Aug 2020 08:29:44 -0700

On Fri, Aug 28, 2020 at 12:03 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Fri, Aug 21, 2020 at 02:33:54PM -0700, Elijah Newren wrote:
>
> > However, there's an important difference here between what I've done
> > and what you've suggested for hashmap: my method did not deallocate
> > hashmap->table in hashmap_clear() and then use lazy initialization.
> > In fact, I think not deallocating the table was part of the charm --
> > the table had already naturally grown to the right size, and because
> > the repository has approximately the same number of paths in various
> > commits, this provided me a way of getting a table preallocated to a
> > reasonable size for all merges after the first (and there are multiple
> > merges either when recursiveness is needed due to multiple merge
> > bases, OR when rebasing or cherry-picking a sequence of commits).
> > This prevented, as hashmap.h puts it, "expensive resizing".
> >
> > So, once again, my performance ideas might be clashing with some of
> > your desires for the API.  Any clever ideas for resolving that?
>
> If the magic is in pre-sizing the hash, then it seems like the callers
> ought to be feeding the size hint. That does make a little more work for
> them, but I think there's real value in having consistent semantics for
> "clear" across our data structures.

I thought about adding a size hint from the callers, but the thing is
I don't know how to get a good one short of running a merge and
querying how big things were sized in that merge.  (In some common
cases I can get an upper bound, but I can't get it in all cases and
that upper bound might be a couple orders of magnitude too big.)
Thus, it's really a case where I just punt on pre-sizing for the first
merge, and use the size from the previous merge for subsequent ones.
If you have a non-recursive merge or are cherry-picking only a single
commit, then no sizing hint is used.

> However, one cheat would be to free the memory but retain the size hint
> after a clear. And then if we lazy-init, grow immediately to the hint
> size. That's more expensive than a true reuse, because we do reallocate
> the memory. But it avoids the repeated re-allocation during growth.
>
> It may also be a sign that we should be growing the hash more
> aggressively in the first place. Of course all of this is predicated
> having some benchmarks. It would be useful to know which part actually
> provided the speedup.

Your thoughts here are great; I also had another one this past week --
I could introduce a hashmap_partial_clear() (in addition to
hashmap_clear()) for the special usecase I have of leaving the table
allocated and pre-sized.  It'd prevent people from accidentally using
it and forgetting to free stuff, while still allowing me to take
advantage.  But, as you say, more benchmarks would be useful to find
which parts provided the speedup before taking any of these steps.