Re: [PATCH v2 04/10] hashmap: introduce a new hashmap_partial_clear()

Elijah Newren <newren@xxxxxxxxx> · Fri, 30 Oct 2020 09:03:38 -0700

On Fri, Oct 30, 2020 at 6:41 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Tue, Oct 13, 2020 at 12:40:44AM +0000, Elijah Newren via GitGitGadget wrote:
>
> > merge-ort is a heavy user of strmaps, which are built on hashmap.[ch].
> > reset_maps() in merge-ort was taking about 12% of overall runtime in my
> > testcase involving rebasing 35 patches of linux.git across a big rename.
> > reset_maps() was calling hashmap_free() followed by hashmap_init(),
> > meaning that not only was it freeing all the memory associated with each
> > of the strmaps just to immediately allocate a new array again, it was
> > allocating a new array that wasy likely smaller than needed (thus
>
> s/wasy/was/

Thanks; will fix.

> > resulting in later need to rehash things).  The ending size of the map
> > table on the previous commit was likely almost perfectly sized for the
> > next commit we wanted to pick, and not dropping and reallocating the
> > table immediately is a win.
> >
> > Add some new API to hashmap to clear a hashmap of entries without
> > freeing map->table (and instead only zeroing it out like alloc_table()
> > would do, along with zeroing the count of items in the table and the
> > shrink_at field).
>
> This seems like a reasonable optimization to make, and doesn't make the
> API significantly more complicated. I'd expect the allocation of actual
> entry objects to dwarf the table allocation, but I guess:
>
>   - you'll deal with the individual entries later using a mempool
>
>   - it's not just the allocation, but the re-insertion of the entries as
>     we grow
>
> It would be nice if we had some actual perf numbers to report here, so
> we could know exactly how much it was buying us. But I guess things are
> a bit out-of-order there. You want to do this series first and then
> build merge-ort on top as a user. We could introduce the basic data
> structure first, then merge-ort, and then start applying optimizations
> with real-world measurements. But I'm not sure it's worth the amount of
> time you'd have to spend to reorganize in that way.

Yeah, the perf benefits didn't really come until I added a
strmap_clear() based on this, so as you discovered I put perf numbers
in patch 7 of this series.  Should I add a mention of the later commit
message at this point in the series?

> >  hashmap.c | 39 +++++++++++++++++++++++++++------------
> >  hashmap.h | 13 ++++++++++++-
>
> The implementation itself looks correct to me. I already mentioned my
> thoughts on naming in patch 1.

I'll circle back to that when I comment on patch 1...