Re: dangling commits and blobs: is this normal?

Geert Bosch <bosch@xxxxxxxxxxx> · Thu, 23 Apr 2009 13:43:18 -0400

On Apr 22, 2009, at 16:05, Jeff King wrote:
The other tradeoff, mentioned by Matthieu, is not about speed, but  
about
rollover of files on disk. I think he would be in favor of a less
optimal pack setup if it meant rewriting the largest packfile less
frequently.

However, it may be reasonable to suggest that he just not manually  
"gc"
then. If he is not generating enough commits to warrant an auto-gc,  
then
he is probably not losing much by having loose objects. And if he is,
then auto-gc is already taking care of it.

For large repositories with lots of large files, git spends too much
time copying large packs for relatively little gain. This is obvious  
when
you include a few dozen large objects in any repository.
Currently, there is no limit to the number of times this data may
be copied. In particular, the average amount of I/O needed for
changes of size X depends linearly on the size of the total repository.
So, the mere presence of a couple of large objects has an large  
distributed overhead.

Wouldn't it be better to have a maximum of N packs, named
pack_0 .. pack_(N - 1),  in the repository with each pack_i being
between 2^i and 2^(i+1)-1 bytes large? We could even dispense
completely with loose objects and instead have each git operation
create a single new pack.

Then the repacking rule simply becomes: if a new pack_i would
overwrite one of the same name, both packs are merged into a new  
pack_(i+1).

To analyze performance, let's assume the worst case, where the
size of a pack is equal to the expanded size of all objects contained  
in it
and new packs only have unique objects. With these assumptions, an  
object
residing in pack_i can only be merged into a pack_j with j > i.
So, if any repository of size n has k objects, the maximum total I/O  
required
to create the repository (counting all operations in its history) is  
O(n log k).

The current situation, the number of repacks required is linear in the  
number of
objects, so the total work required is more like O(n k).

While I understand that the above is a gross simplification, and actual
performance is dictated by packing efficiency and constant factors  
rather
than asymptotic performance, I think the general idea of limiting the
number of packs in the way described is useful and will lead to  
significant
speedups, especially during large imports that currently require  
frequent
repacking of the entire repository.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html