On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote: > >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); > >> + rl->hash += sha_core; > > > > Hmm, so the first 64-bits of the oid of each ref that is part of > > this island is added together as a 'hash' for the island. And this > > is used to de-duplicate the islands? Any false positives? (does it > > matter - it would only affect performance, not correctness, right?) > > I would think that a false positive from pure chance is very unlikely. > We would need to approach billions of delta islands (as 2 to the power > 64/2 is in the order of billions) for the probability to be > significant. GitHub has less than 50 millions users and it is very > unlikely that a significant proportion of these users will fork the > same repo. > > Now if there is a false positive because two forks have exactly the > same refs, then it is not a problem if they are considered the same, > because they are actually the same. Right, the idea is to find such same-ref setups to avoid spending a pointless bit in the per-object bitmap. In the GitHub setup, it would be an indication that two people forked at exactly the same time, so they have the same refs and the same delta requirements. If one of them later updates, that relationship would change at the next repack. I don't know that we ever collected numbers for how often this happens. So let me see if I can dig some up. On our git/git repository network, it looks like we have ~14k forks, and ~4k are unique by this hashing scheme. So it really is saving us 10k-bits per bitmap. That's over 1k-byte per object in the worst case. There are ~24M objects (many times what is in git.git, but people push lots of random things to their forks), so that's saving us up to 24GB in RAM. Of course it almost certainly isn't that helpful in practice, since we copy-on-write the bitmaps to avoid the full cost per object. But I think it's fair to say it is helping (more numbers below). The distribution of buckets itself looks pretty power-law-ish from eyeballing it. The biggest one has 92 forks, and then it trails off quickly to smaller numbers, and then eventually a long tail of singletons. The numbers for torvalds/linux are similar. There are ~23k forks, only ~8k of which are unique. rails/rails has 12k/17k unique. So there's some fluctuation, but the notion that there will be a bunch of identical forks seems solid. If you're curious what the actual memory usage is for a repack of git/git, here are some rough numbers from eyeballing RSS via "top" while watching the repack progress meter (I didn't want to do a full valgrind run because it's S-L-O-W): - after loading the islands, we're at ~3GB RSS. Some of that is shared (the mmap'd packed-refs file is 1GB by itself), but probably 1-1.5GB is real heap, likely from the gigantic fork list. - we grow to around 6GB RSS before starting "counting objects". I think this is largely prepare_revision_walk(), since we're making "struct commit" objects for all the ref tips (and this is mmap-ing packfiles, too, which counts against our RSS) - by the time "counting objects" is done, we're at ~10GB. We're not yet running with Duy's memory-slimming for pack-objects, so it's 100+ bytes per object_entry. Which means we're probably only spending a GB or so on island bitmaps, but at this point we've only marked commits and root-trees. - By the time "propagating island marks" is done, we'll have marked all the trees and blobs, and RSS is up to ~11GB. So all in all (and I'd emphasize this is extremely rough) I think it probably costs about 2GB for the feature in this particular case. But you need much more to repack at this size sanely anyway. If I were doing this patch series from scratch, I might implement the fork de-duping and copy-on-write bits successively so I could take good measurements of how much they help (and brag about it in the commit messages). But I don't know if it's worth untangling now or not. -Peff