Re: dangling commits and blobs: is this normal?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Apr 22, 2009, at 16:05, Jeff King wrote:
The other tradeoff, mentioned by Matthieu, is not about speed, but about
rollover of files on disk. I think he would be in favor of a less
optimal pack setup if it meant rewriting the largest packfile less
frequently.

However, it may be reasonable to suggest that he just not manually "gc" then. If he is not generating enough commits to warrant an auto-gc, then
he is probably not losing much by having loose objects. And if he is,
then auto-gc is already taking care of it.

For large repositories with lots of large files, git spends too much
time copying large packs for relatively little gain. This is obvious when
you include a few dozen large objects in any repository.
Currently, there is no limit to the number of times this data may
be copied. In particular, the average amount of I/O needed for
changes of size X depends linearly on the size of the total repository.
So, the mere presence of a couple of large objects has an large distributed overhead.

Wouldn't it be better to have a maximum of N packs, named
pack_0 .. pack_(N - 1),  in the repository with each pack_i being
between 2^i and 2^(i+1)-1 bytes large? We could even dispense
completely with loose objects and instead have each git operation
create a single new pack.

Then the repacking rule simply becomes: if a new pack_i would
overwrite one of the same name, both packs are merged into a new pack_(i+1).

To analyze performance, let's assume the worst case, where the
size of a pack is equal to the expanded size of all objects contained in it and new packs only have unique objects. With these assumptions, an object
residing in pack_i can only be merged into a pack_j with j > i.
So, if any repository of size n has k objects, the maximum total I/O required to create the repository (counting all operations in its history) is O(n log k).

The current situation, the number of repacks required is linear in the number of
objects, so the total work required is more like O(n k).

While I understand that the above is a gross simplification, and actual
performance is dictated by packing efficiency and constant factors rather
than asymptotic performance, I think the general idea of limiting the
number of packs in the way described is useful and will lead to significant speedups, especially during large imports that currently require frequent
repacking of the entire repository.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]