Re: [PATCH] gc: do not warn about too many loose objects

Jeff King <peff@xxxxxxxx> · Mon, 16 Jul 2018 16:29:16 -0400

On Mon, Jul 16, 2018 at 12:54:31PM -0700, Jonathan Nieder wrote:

> Even restricting attention to the warning, not the exit code, I think
> you are saying the current behavior is acceptable.  I do not believe
> it is.

I didn't say that at all. The current situation sucks, and I think the
right solution is to pack the unreachable objects instead of exploding
them.

> I think you are saying Jonathan's patch makes the behavior
> worse, and I'm not seeing it.  Can you describe an example user
> interaction where the current behavior would be better than the
> behavior after Jonathan's patch?  That might help make this discussion
> more concrete.

It makes it worse because there is nothing to throttle a "gc --auto"
that is making no progress.

Try this:

-- >8 --
#!/bin/sh

rm -rf repo

# new mostly-empty repo
git init repo
cd repo
git commit --allow-empty -m base

# make a bunch of unreachable blobs
for i in $(seq 7000); do
  echo "blob"
  echo "data <<EOD"
  echo "cruft $i"
  echo "EOD"
done |
git fast-import

# This gc will explode them (doing a "gc --auto" isn't sufficient here, because
# we don't have enough _other_ material in the repo to trigger a gc. But you
# can imagine that a normal repo would eventually "gc --auto".
git gc

# Now simulate some real work in the repo.
for i in $(seq 50); do
  git commit --allow-empty -m "commit $i"
  sleep 30
done
-- 8< --

With the current code, that produces a bunch of annoying warnings for
every commit ("I couldn't gc because the last gc reported a warning").
But after Jonathan's patch, every single commit will do a full gc of the
repository. In this tiny repository that's relatively quick, but in a
real repo it's a ton of CPU and I/O, all for nothing.

> [...]
> > See the thread I linked earlier on putting unreachable objects into a
> > pack, which I think is the real solution.
> 
> Have you looked over the discussion in "Loose objects and unreachable
> objects" in Documentation/technical/hash-function-transition.txt?  Do
> you have thoughts on it (preferrably in a separate thread)?

It seems to propose putting the unreachable objects into a pack. Which
yes, I absolutely agree with (as I thought I'd been saying in every
single email in this thread).

> > I mean that if you do not write a persistent log, then "gc --auto" will
> > do an unproductive gc every time it is invoked. I.e., it will see "oh,
> > there are too many loose objects", and then waste a bunch of CPU every
> > time you commit.
> 
> If so, then this would already be the behavior in non daemon mode.
> Are you saying this accidental effect of daemon mode is in fact
> intentional?

I'm not sure if I'd call it intentional, since I don't recall ever
seeing this side effect discussed. But since daemonizing is the default,
I think that's the case people have focused on when they hit annoying
cases. E.g., a831c06a2b (gc: ignore old gc.log files, 2017-02-10).

I'd consider the fact that "gc --auto" with gc.autoDetach=false will
repeatedly do useless work to be a minor bug. But I think Jonathan's
patch makes it even worse because we do not even tell people that
useless work is being done (while making them wait for each gc to
finish!). Even if we were going to remove this message to help the
daemonized case, I think we'd want to retain it for the non-daemon case.

> > A daemonized git-gc runs a bunch of sub-commands (e.g., "git prune")
> > with their stderr redirected into the logfile. If you want to have
> > warnings go somewhere else, then either:
> >
> >   - you need some way to tell those sub-commands to send the warnings
> >     elsewhere (i.e., _not_ stderr)
> >
> >     or
> >
> >   - you have to post-process the output they send to separate warnings
> >     from other errors. Right now that means scraping. If you are
> >     proposing a system of machine-readable output, it would need to work
> >     not just for git-gc, but for every sub-command it runs.
> 
>       or
> 
>     - you can simply record and trust the exit code
> 
> A simple way to do that without changing formats is to truncate the
> file when exiting with status 0.

That's a different behavior than what we do now (and what was suggested
earlier), in that it assumes that anything written to stderr is OK for
gc to hide from the user if the process exits with code zero.

That's probably OK in most cases, though I wonder if there are corner
cases. For example, if you have a corrupt ref, we used to say "error:
refs/heads/foo does not point to a valid object!" but otherwise ignore
it. These days git-gc sets REF_PARANOIA=1, so we'll actually barf on a
corrupt ref. But I wonder if there are other cases lurking.

-Peff