On Wed, Sep 14, 2016 at 05:52:06PM -0700, Stefan Beller wrote: > > So finding "1011" involves traversing the trie: down the "1" > > side, then the "0" side, and then check that the rest > > matches "11". > > So we stop building a tree as soon as we hit a unique data > element (i.e. if we stick to the idea of encoding the hash along the way, > we would have needed another node each for "11" as well as "00" > that points to NULL and the ending data respectively. > > So we stop short as soon as we have a unique. > > That makes insertion very easy, because as soon as we hit > a unique, we'd just introduce a node and add the two uniques > left and right. (Well what happens if we were to insert > 101010111 and 101010101 ? Both have a long prefix, > I suppose we'd have 7 nodes and then the distiguishing > node for those 2.) I think it actually can represent the interior sequence as a single node. That's what the "critical" part of critbit is for; the critical bit is the one where we diverge. But I'm not 100% sure that it's implemented that way. In practice, though, sha1 would give us a pretty full tree near the front, so I'd expect each bit to be "critical". > > It's possible that (2) would be better if instead of a > > critbit tree, we used a "critbyte" tree. That would involve > > fewer node traversals, at the cost of making each node > > larger (right now the internal nodes store 2 pointer slots; > > they'd have to store 256 to handle a full byte). I didn't > > try it, but I suspect it would still be slower for the same > > reasons. > > I would expect to go for a crit-"variable-length" tree instead. > > The reason for this is that a higher fan out seems to be more > beneficial in the earlier stages, e.g. we could use critbyte trees > for the first 1-3 layers in the tree as that will have good memory > efficiency (all 256 slots filled), but will be faster than the critbit trees > (one lookup instead of 8 conditional jumps). Yeah, I suspect a hybrid approach may be a better tradeoff. I encourage you to try and measure. :) > I guess when trying to improve the hashsets, someone tried trees > as a collision resolver? I don't think so. hashmap.c uses a linked list, but obj_hash just does linear probing. Both are linear, but the latter is more memory efficient and especially has a more compact cache footprint when resolving collisions. The downside of open probing like that is that it's very hard to delete an entry, but we don't care about that for obj_hash. I don't think that improving collision resolution would help that much for obj_hash, though. The fact that quadratic probing and cuckoo hashing didn't yield big benefits implies that we don't spend most of our time on collisions in the first place (OTOH, my 9a414486d9f0 that you found does help _only_ when there are collisions, so maybe I'm wrong). > So maybe we could have some software sided cache for hot entries? > (I imagine a data structure consisting of 2 hash sets here, one > hashset containing > the complete data set, and the other hashset is a very small hashset with e.g. > just 256(?) entries that are an LRU cache for the cache entries. > Though this sounds like we'd be trying to outsmart the hardware... Not sure > I'd expect gains from that) Yeah. Basically what we want is the reverse of a bloom filter: it's OK to be wrong, but it most be wrong to say "it's not here" when it is, not the other way around. And so that's basically...a cache of a smaller data-structure in front of the larger one. But given that the hash table is already O(1)-ish, it's hard to beat that (because remember when you have a cache miss, you then have to do an extra lookup in the full table anyway). I did play around with stuff like that when I was coming up with 9a414486d9f0, but was never able to make it faster. Patches welcome, of course. > I guess we rather want to split up the data sets on the application > side: i.e. instead of > having so many objects, have hash sets for e.g. blobs, trees, commits separately > and then use slightly different strategies there (different load factors?) My gut is that would not make a big difference (and sometimes it would be worse, because we don't know what the object type is, or we want to make _sure_ that we don't have the object known as a different type). > Unrelated note about hashmaps: > I wonder if we rather want to give good initial estimates of the size > as one improvement My recollection is that the growth isn't really relevant. At least for obj_hash, we do _way_ more lookups than insertions, so the only thing that really matters is lookup speed. But as with everything, if you can come up with improved numbers, I'm happy to look at the patches. :) -Peff