On Fri, Aug 21, 2020 at 2:33 PM Elijah Newren <newren@xxxxxxxxx> wrote: > > On Fri, Aug 21, 2020 at 1:16 PM Jeff King <peff@xxxxxxxx> wrote: > > > > On Fri, Aug 21, 2020 at 06:52:24PM +0000, Elijah Newren via GitGitGadget wrote: > > > > > Here I introduce a new strmap type, which my new merge backed, merge-ort, > > > uses heavily. (I also made significant use of it in my changes to > > > diffcore-rename). This strmap type was based on Peff's proposal from a > > > couple years ago[1], but has additions that I made as I used it. I also > > > start the series off with a quick documentation improvement to hashmap.c to > > > differentiate between hashmap_free() and hashmap_free_entries(), since I > > > personally had difficulty understanding them and it affects how > > > strmap_clear()/strmap_free() are written. > > > > I like the direction overall (unsurprisingly), but left a bunch of > > comments. I do think if we're going to do this that it may be worth > > cleaning up hashmap a bit first, especially around its clear/free > > semantics, and its ability to lazy-allocate the table. > > > > I'm happy to work on that, but don't want to step on your toes. > > I have patches which introduce hashmap_clear() and > hashmap_clear_entries() to hashmap.[ch], which allowed me to simplify > strmap_clear(); instead of needing to call both > hashmap_free[_entries]() && strmap_init(), I could just call > hashmap_clear[_entries](). Doing that surprised me with a significant > performance impact (in a good direction), at which point I started > adding mem-pool integration into hashmap for storing the entries that > hashmap.c allocates and got further good speedups. > > I thought those were better explained when I got to the performance > stuff, so I had held off on those patches. I could pull them out and > submit them first. > > However, there's an important difference here between what I've done > and what you've suggested for hashmap: my method did not deallocate > hashmap->table in hashmap_clear() and then use lazy initialization. > In fact, I think not deallocating the table was part of the charm -- > the table had already naturally grown to the right size, and because > the repository has approximately the same number of paths in various > commits, this provided me a way of getting a table preallocated to a > reasonable size for all merges after the first (and there are multiple > merges either when recursiveness is needed due to multiple merge > bases, OR when rebasing or cherry-picking a sequence of commits). > This prevented, as hashmap.h puts it, "expensive resizing". > > So, once again, my performance ideas might be clashing with some of > your desires for the API. Any clever ideas for resolving that? > > Also, since you want to see hashmap cleanup first, should I submit > just the hashmap_clear[_entries()] stuff, or should I also submit the > API additions to allow mem-pool integration in hashmap (it's pretty > small and self-contained, but it'll be a while before I submit the > patches that use it...)? Nevermind, I mis-remembered. The mempool integration was added specifically to strmap, not to hashmap, because strmap_put() does the allocation of the str_entry. So I'll just pull out the hashmap_clear[_entries]() stuff and send it up. > > > I also wonder if you looked at the khash stuff at all. Especially for > > storing integers, it makes things much more natural. You'd do something > > like: > > > > /* you might even be able to just write !strcmp in the macro below */ > > static inline int streq(const char *a, const char *b) > > { > > return !strcmp(a, b); > > } > > > > KHASH_INIT(strint_map, char *, int, 1, strhash, streq); > > > > and then you'd probably want a "put" wrapper that makes a copy of the > > string. khash has its own charming awkwardness, but I'm just curious if you > > looked at it and found it more awkward than hashmap.c, or if you just > > didn't look at it. > > I did look at it, but only briefly. I had a further investigation on > my TODO list for months, along with several other improvement ideas. > But it seemed like my TODO list was really long, and my new merge > backend hasn't benefited anyone yet. At some point, I decided to punt > on it and other ideas and start cleaning up my code and submitting. I > believe merge-ort is more accurate than merge-recursive (it fixes > several test_expect_failures) and is a lot faster as well for the > cases I'm looking at. So, for now, I've pulled it off my radar. > > But I'd be really happy if someone else wanted to jump in and try > switching out hashmap for khash in the strmap API and see if it helps > merge-ort performance. :-)