On Mon, Aug 13, 2018 at 9:00 PM, Jeff King <peff@xxxxxxxx> wrote: > On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote: > >> >> + memcpy(&sha_core, oid->hash, sizeof(uint64_t)); >> >> + rl->hash += sha_core; >> > >> > Hmm, so the first 64-bits of the oid of each ref that is part of >> > this island is added together as a 'hash' for the island. And this >> > is used to de-duplicate the islands? Any false positives? (does it >> > matter - it would only affect performance, not correctness, right?) >> >> I would think that a false positive from pure chance is very unlikely. >> We would need to approach billions of delta islands (as 2 to the power >> 64/2 is in the order of billions) for the probability to be >> significant. GitHub has less than 50 millions users and it is very >> unlikely that a significant proportion of these users will fork the >> same repo. >> >> Now if there is a false positive because two forks have exactly the >> same refs, then it is not a problem if they are considered the same, >> because they are actually the same. > > Right, the idea is to find such same-ref setups to avoid spending a > pointless bit in the per-object bitmap. In the GitHub setup, it would be > an indication that two people forked at exactly the same time, so they > have the same refs and the same delta requirements. If one of them later > updates, that relationship would change at the next repack. > > I don't know that we ever collected numbers for how often this happens. > So let me see if I can dig some up. > > On our git/git repository network, it looks like we have ~14k forks, and > ~4k are unique by this hashing scheme. So it really is saving us > 10k-bits per bitmap. That's over 1k-byte per object in the worst case. > There are ~24M objects (many times what is in git.git, but people push > lots of random things to their forks), so that's saving us up to 24GB in > RAM. Of course it almost certainly isn't that helpful in practice, since > we copy-on-write the bitmaps to avoid the full cost per object. But I > think it's fair to say it is helping (more numbers below). [...] > So all in all (and I'd emphasize this is extremely rough) I think it > probably costs about 2GB for the feature in this particular case. But > you need much more to repack at this size sanely anyway. Thanks for the interesting numbers!