On Thu, Sep 28, 2017 at 10:46:16AM -0700, Jonathan Tan wrote: > > To me it seems like a much simpler API for a map would be to just allow > > callers to store a 'void *' as the value. > > I agree that the API would be simpler. > > My main motivation with this design is indeed to save memory, and not > inconvenience the user too much (in the case where you're storing things > larger than one pointer, you just need to remember to put the special > struct at the beginning of your struct), but if memory is not so > important, I agree that we can switch to the "util" design. When I saw that you were implementing "oidset" in terms of "oidmap", I was all ready to be crabby about this extra memory. But then I saw that the implementation tries hard not to waste any memory. :) All of which is to say I gave this some thought when I was in the "ready to be crabby" phase, and came to the conclusion that it probably isn't that painful. An unused pointer is 8 bytes per entry. We're already spending 20 for the oid itself (which is likely to grow to 32 eventually), plus 8 for the chained "next" pointer. Plus potentially 8 for a padded version of the hash, if we just use a straight hashmap that duplicates the hash field. So depending how you count it, we're wasting between 28% (sha1 and no extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but it's probably not breaking the bank. Another way of thinking about it. Large-ish (but not insane) repos have on the order of 5-10 million objects. If we had an oidset that mentioned every single object in the repository, that's 40-80MB wasted in the worst case. For current uses of oidset, that's probably fine. It's generally used only to collect ref tips (so probably two orders of magnitude less). If you're planning on using an oidset to mark every object in a 100-million-object monorepo, we'd probably care more. But I'd venture to say that any scheme which involves generating that hash table on the fly is doing it wrong. At at that scale we'd want to look at compact mmap-able on-disk representations. So I think we may be better off going with the solution here that's simpler and requires introducing less code. If it does turn out to be a memory problem in the future, this is a _really_ easy thing to optimize after the fact, because we have these nice abstractions. -Peff