Re: [RFC PATCH] We should add a "git gc --auto" after "git clone" due to commit graph

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Fri, 05 Oct 2018 22:01:31 +0200

On Fri, Oct 05 2018, Jeff King wrote:

> On Fri, Oct 05, 2018 at 03:41:40PM -0400, Derrick Stolee wrote:
>
>> > So can we really just take (total_objects - commit_graph_objects) and
>> > compare it to some threshold?
>>
>> The commit-graph only stores the number of _commits_, not total objects.
>
> Oh, right, of course. That does throw a monkey wrench in that line of
> thought. ;)
>
> There's unfortunately not a fast way of doing that. One option would be
> to keep a counter of "ungraphed commit objects", and have callers update
> it. Anybody admitting a pack via index-pack or unpack-objects can easily
> get this information. Commands like fast-import can do likewise, and
> "git commit" obviously increments it by one.
>
> I'm not excited about adding a new global on-disk data structure (and
> the accompanying lock).

You don't really need a new global datastructure to solve this
problem. It would be sufficient to have git-gc itself write out a 4-line
text file after it runs saying how many tags, commits, trees and blobs
it found on its last run.

You can then fuzzily compare object counts v.s. commit counts for the
purposes of deciding whether something like the commit-graph needs to be
updated, while assuming that whatever new data you have has similar
enough ratios of those as your existing data.

That's an assumption that'll hold well enough for big repos where this
matters the most, and who tend to grow in fairly uniform ways as far as
their object type ratios go.

Databases like MySQL, PostgreSQL etc. pull similar tricks with their
fuzzy table statistics.