On Fri, Sep 29, 2017 at 12:04:56PM -0700, Jonathan Tan wrote: > > So depending how you count it, we're wasting between 28% (sha1 and no > > extra hash) and 16% (sha256 plus reusing hashmap). That's not great, but > > it's probably not breaking the bank. > > Hmm...how did you get the 16% figure? The values, as I see it, are: > - 32 for the sha256 hash itself > - 8 for the "next" chain pointer > - 8 for the padded hash > - 8 for the "util" pointer > > For an oidset, the padded hash and the "util" pointer are wasted, which is > 16/56=0.286. (If you assume, say, 8 bytes of malloc bookkeeping overhead, then > 16/64=0.25.) Sorry to be unclear. I was just counting the waste for the "util" pointer, not the extra padded hash bit (which AFAIK is already wasted in oidset). So I computed 48 bytes without the util pointer, which means we waste an additional 16% to add it. Anyway, my point was mostly to say that this is a fractional percentage of the total memory. So it falls into the category of "this might help in tight situations" and less "this will blow up in our faces". > In a 100-million-object monorepo, we will probably end up only operating on the > "frontier" objects (objects that we do not have but we know about because some > object we have reference them) at the worst. I don't have numbers but I think > that that is at least an order of magnitude less than 100M. Agreed. > > So I think we may be better off going with the solution here that's > > simpler and requires introducing less code. If it does turn out to be a > > memory problem in the future, this is a _really_ easy thing to optimize > > after the fact, because we have these nice abstractions. > > Optimizing away the padded hash should be straightforward. Optimizing away the > "util" pointer after we have code that uses it is (slightly?) less > straightforward, but still doable. I was thinking here of just oidset. It has a limited API, so swapping out the implementation for one that does not depend on oidmap and waste the "util" pointer would be the "easy" part. > I still lean towards saving memory by eliminating the padded hash and > not using the "util" pointer, because the complication is contained > within only two files (oidmap.{c,h}), and because the constant factors > in memory savings might end up mattering. But I don't feel strongly > about that - I think any of the oidmaps that we have discussed is an > improvement over what we have now. My main concern is not so much about complexity bleeding out of the oidmap code, but that now we have two parallel near-identical hashmap implementations. When one gets an optimization, bugfix, or new feature, we have to port it over manually to the other. -Peff