Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?

Lars Schneider <larsxschneider@xxxxxxxxx> · Sun, 18 Jun 2017 15:22:29 +0200

> On 10 Jun 2017, at 10:06, Jeff King <peff@xxxxxxxx> wrote:
> 
> On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:
> 
>>> I agree the existing message isn't great. There should probably be a big
>>> advise() block explaining what's going on (and that expert users can
>>> disable).
>> 
>> How about this?
>> 
>> diff --git a/builtin/gc.c b/builtin/gc.c
>> index c2c61a57bb..12ee212544 100644
>> --- a/builtin/gc.c
>> +++ b/builtin/gc.c
>> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>> 	if (pack_garbage.nr > 0)
>> 		clean_pack_garbage();
>> 
>> -	if (auto_gc && too_many_loose_objects())
>> -		warning(_("There are too many unreachable loose objects; "
>> -			"run 'git prune' to remove them."));
>> +	if (auto_gc && too_many_loose_objects()) {
>> +		warning(_("Auto packing did not lead to optimal results as the "
>> +			"repository contains too many unreachable objects."));
>> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
>> +			"that are not referenced by any branch or tag. This might happen "
>> +			"if you use 'git rebase' or if you delete branches. Auto packing "
>> +			"only prunes unreachable objects that are older than 2 weeks "
>> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
>> +			"Please run 'git prune' to prune all unreachable objects for "
>> +			"optimal repository performance."));
>> +	}
> 
> s/advice/advise/, of course. This probably be protected by a new entry
> in advice_config[] in advice.c.
> 
> But I assume you are most interested in the text. I think it
> simultaneously goes into too much and too little detail. I think the
> warning itself should just say _what_ we observed: after garbage
> collection, there were still enough objects to trigger a gc.  And then
> the hint doesn't need to go into the details of why we prune or what
> unreachable objects are. Those can be cross-referenced with other
> documentation. I think we need to focus on what the warning means, and
> whether and how they would correct it.
> 
> Maybe:
> 
>  warning: too many loose objects remain after garbage collection
>  hint: Automatic garbage collection is triggered when there are a
>  hint: large number of unpacked objects in the repository. Unreachable
>  hint: objects that are more recent than gc.pruneExpire are not
>  hint: pruned. If there are too many of these recent loose
>  hint: objects, automatic garbage collection may be triggered more
>  hint: frequently than necessary. You may run "git prune" now to
>  hint: prune all unreachable objects, regardless of their age.
> 
> I was tempted to suggest that we find and report the correct "prune"
> cutoff that would let us avoid auto-gc. I.e., sort the unreachable
> objects by timestamp and find the cutoff that will drop enough to leave
> fewer than `gc.auto`. That in theory makes things a bit safer. That's
> probably not a good idea, though:
> 
>  1. Telling the user to run `git prune --expire=37.minutes.ago` is
>     just going to confuse them. We could hide it behind a command line
>     option like `git prune --expire-auto-gc` or something, though.
> 
>  2. Now that we try to keep recent chunks, the analysis isn't quite so
>     easy. You may have a single recent commit that references a ton of
>     old history, and only dropping that commit would help. So the
>     analysis is harder than a simple sort-and-cutoff, but it also means
>     that the prune times are likely to skew close to "now".
> 
>  3. If we just show them how to prune the minimal amount, then they're
>     likely to just hit this message again soon.
> 
> So that's probably a dead end.
> 
> To be honest, the fact that we have to write this warning at all is a
> sign that Git is not doing a very good job. The best place to spend
> effort would be to teach git-gc to pack all of the unreachable objects
> into a single "cruft" pack, so this problem doesn't happen at all (and
> it's way more efficient, as well).
> 
> The big problem with that approach is that we lose individual-object
> timestamps. Each object just gets the timestamp of its surrounding pack,
> so as we continually ran auto-gc, the cruft-pack timestamp would get
> updated and we'd never drop objects. So we'd need some auxiliary file
> (e.g., pack-1234abcd.times) that stores the per-object timestamps. This
> can be as small as a 32- or 64-bit int per object, since we can just
> index it by the existing object list in the pack .idx.

Why can't we generate a new cruft-pack on every gc run that detects too
many unreachable objects? That would not be as efficient as a single
cruft-pack but it should be way more efficient than the individual
objects, no?

Plus, chances are that the existing cruft-packs are purged with the next
gc run anyways.

- Lars

> The trickiest part would be when an object's timestamp gets freshened
> (because somebody tried to write it again but we optimized out the
> write). Updating the timestamps in the .times file would probably work
> atomically, though we usually avoid writing in the middle of a file (we
> certainly can't portably do so via mmap, and I can't think of another
> case where we do seeked writes). It might be sufficient for objects in
> the cruft pack to just do the actual loose object write instead of
> optimizing it out. It should happen very rarely.