Re: [PATCH v4 1/7] Add delta-islands.{c,h}

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Mon, Aug 13, 2018 at 9:00 PM, Jeff King <peff@xxxxxxxx> wrote:
> On Mon, Aug 13, 2018 at 05:33:59AM +0200, Christian Couder wrote:
>
>> >> +     memcpy(&sha_core, oid->hash, sizeof(uint64_t));
>> >> +     rl->hash += sha_core;
>> >
>> > Hmm, so the first 64-bits of the oid of each ref that is part of
>> > this island is added together as a 'hash' for the island. And this
>> > is used to de-duplicate the islands? Any false positives? (does it
>> > matter - it would only affect performance, not correctness, right?)
>>
>> I would think that a false positive from pure chance is very unlikely.
>> We would need to approach billions of delta islands (as 2 to the power
>> 64/2 is in the order of billions) for the probability to be
>> significant. GitHub has less than 50 millions users and it is very
>> unlikely that a significant proportion of these users will fork the
>> same repo.
>>
>> Now if there is a false positive because two forks have exactly the
>> same refs, then it is not a problem if they are considered the same,
>> because they are actually the same.
>
> Right, the idea is to find such same-ref setups to avoid spending a
> pointless bit in the per-object bitmap. In the GitHub setup, it would be
> an indication that two people forked at exactly the same time, so they
> have the same refs and the same delta requirements. If one of them later
> updates, that relationship would change at the next repack.
>
> I don't know that we ever collected numbers for how often this happens.
> So let me see if I can dig some up.
>
> On our git/git repository network, it looks like we have ~14k forks, and
> ~4k are unique by this hashing scheme. So it really is saving us
> 10k-bits per bitmap. That's over 1k-byte per object in the worst case.
> There are ~24M objects (many times what is in git.git, but people push
> lots of random things to their forks), so that's saving us up to 24GB in
> RAM. Of course it almost certainly isn't that helpful in practice, since
> we copy-on-write the bitmaps to avoid the full cost per object. But I
> think it's fair to say it is helping (more numbers below).

[...]

> So all in all (and I'd emphasize this is extremely rough) I think it
> probably costs about 2GB for the feature in this particular case. But
> you need much more to repack at this size sanely anyway.

Thanks for the interesting numbers!



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux