Re: "There are too many unreachable loose objects" - why don't we run 'git prune' automatically?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jun 09, 2017 at 02:03:18PM +0200, Lars Schneider wrote:

> > I agree the existing message isn't great. There should probably be a big
> > advise() block explaining what's going on (and that expert users can
> > disable).
> 
> How about this?
> 
> diff --git a/builtin/gc.c b/builtin/gc.c
> index c2c61a57bb..12ee212544 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -473,9 +473,18 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
>  	if (pack_garbage.nr > 0)
>  		clean_pack_garbage();
>  
> -	if (auto_gc && too_many_loose_objects())
> -		warning(_("There are too many unreachable loose objects; "
> -			"run 'git prune' to remove them."));
> +	if (auto_gc && too_many_loose_objects()) {
> +		warning(_("Auto packing did not lead to optimal results as the "
> +			"repository contains too many unreachable objects."));
> +		advice(_("Unreachable objects are Git objects (commits, files, ...) "
> +			"that are not referenced by any branch or tag. This might happen "
> +			"if you use 'git rebase' or if you delete branches. Auto packing "
> +			"only prunes unreachable objects that are older than 2 weeks "
> +			"(default, overridable by the config variable 'gc.pruneExpire'). "
> +			"Please run 'git prune' to prune all unreachable objects for "
> +			"optimal repository performance."));
> +	}

s/advice/advise/, of course. This probably be protected by a new entry
in advice_config[] in advice.c.

But I assume you are most interested in the text. I think it
simultaneously goes into too much and too little detail. I think the
warning itself should just say _what_ we observed: after garbage
collection, there were still enough objects to trigger a gc.  And then
the hint doesn't need to go into the details of why we prune or what
unreachable objects are. Those can be cross-referenced with other
documentation. I think we need to focus on what the warning means, and
whether and how they would correct it.

Maybe:

  warning: too many loose objects remain after garbage collection
  hint: Automatic garbage collection is triggered when there are a
  hint: large number of unpacked objects in the repository. Unreachable
  hint: objects that are more recent than gc.pruneExpire are not
  hint: pruned. If there are too many of these recent loose
  hint: objects, automatic garbage collection may be triggered more
  hint: frequently than necessary. You may run "git prune" now to
  hint: prune all unreachable objects, regardless of their age.

I was tempted to suggest that we find and report the correct "prune"
cutoff that would let us avoid auto-gc. I.e., sort the unreachable
objects by timestamp and find the cutoff that will drop enough to leave
fewer than `gc.auto`. That in theory makes things a bit safer. That's
probably not a good idea, though:

  1. Telling the user to run `git prune --expire=37.minutes.ago` is
     just going to confuse them. We could hide it behind a command line
     option like `git prune --expire-auto-gc` or something, though.

  2. Now that we try to keep recent chunks, the analysis isn't quite so
     easy. You may have a single recent commit that references a ton of
     old history, and only dropping that commit would help. So the
     analysis is harder than a simple sort-and-cutoff, but it also means
     that the prune times are likely to skew close to "now".

  3. If we just show them how to prune the minimal amount, then they're
     likely to just hit this message again soon.

So that's probably a dead end.

To be honest, the fact that we have to write this warning at all is a
sign that Git is not doing a very good job. The best place to spend
effort would be to teach git-gc to pack all of the unreachable objects
into a single "cruft" pack, so this problem doesn't happen at all (and
it's way more efficient, as well).

The big problem with that approach is that we lose individual-object
timestamps. Each object just gets the timestamp of its surrounding pack,
so as we continually ran auto-gc, the cruft-pack timestamp would get
updated and we'd never drop objects. So we'd need some auxiliary file
(e.g., pack-1234abcd.times) that stores the per-object timestamps. This
can be as small as a 32- or 64-bit int per object, since we can just
index it by the existing object list in the pack .idx.

The trickiest part would be when an object's timestamp gets freshened
(because somebody tried to write it again but we optimized out the
write). Updating the timestamps in the .times file would probably work
atomically, though we usually avoid writing in the middle of a file (we
certainly can't portably do so via mmap, and I can't think of another
case where we do seeked writes). It might be sufficient for objects in
the cruft pack to just do the actual loose object write instead of
optimizing it out. It should happen very rarely.

-Peff



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]