Re: [PATCH 13/16] prune: keep objects reachable from recent objects

Michael Haggerty <mhagger@xxxxxxxxxxxx> · Tue, 07 Oct 2014 18:29:00 +0200

On 10/03/2014 10:39 PM, Jeff King wrote:
> [...]
> Instead, this patch pushes the extra work onto prune, which
> runs less frequently (and has to look at the whole object
> graph anyway). It creates a new category of objects: objects
> which are not recent, but which are reachable from a recent
> object. We do not prune these objects, just like the
> reachable and recent ones.
> 
> This lets us avoid the recursive check above, because if we
> have an object, even if it is unreachable, we should have
> its referent:
> 
>   - if we are creating new objects, then we cannot create
>     the parent object without having the child
> 
>   - and if we are pruning objects, will not prune the child
>     if we are keeping the parent
> 
> The big exception would be if one were to write the object
> in a way that avoided referential integrity (e.g., using
> hash-object). But if you are in the habit of doing that, you
> deserve what you get.
> 
> Naively, the simplest way to implement this would be to add
> all recent objects as tips to the reachability traversal.
> However, this does not perform well. In a recently-packed
> repository, all reachable objects will also be recent, and
> therefore we have to consider each object twice (both as a
> tip, and when we reach it in the traversal). I tested this,
> and it added about 10s to a 30s prune on linux.git. This
> patch instead performs the normal reachability traversal
> first, then follows up with a second traversal for recent
> objects, skipping any that have already been marked.

I haven't read all of the old code, but if I understand correctly this
is your new algorithm:

1. Walk from all references etc., marking reachable objects.

2. Iterate over *all* objects, in no particular order, skipping the
   objects that are already known to be reachable. Use any unreachable
   object that has a recent mtime as a tip for a second traversal that
   marks all of its references as "to-keep".

3. Iterate over any objects that are not marked "to-keep". (I assume
   that this iteration is in no particular order.) For each object:

   * [Presumably] verify that its mtime is still "old"
   * If so, prune the object

I see some problems with this.

* The one that you mentioned in your cover letter, namely that prune's
  final mtime check is not atomic with the object deletion. I agree
  that this race is so much shorter than the others that we can accept
  a solution that doesn't eliminate it, so let's forget about this one.

* If the final mtime check fails, then the object is recognized as new
  and not pruned. But does that prevent its referents from being pruned?

  * When this situation is encountered, you would have to start another
    object traversal starting at the "renewed" object to mark its
    referents "to-keep". I don't see that you do this. Another, much
    less attractive alternative would be to abort the prune operation
    if this situation arises. But even if you do one of these...

  * ...assuming the iteration in step 3 is in no defined order, a
    referent might *already* have been pruned before you notice the
    "renewed" object.

So although your changes are a big improvement, it seems to me that they
still leave a race with a window approximately as long as the time it
takes to scan and prune the unreachable objects.

I think that the only way to get rid of that race is to delete objects
in parent-to-child order; in other words, *only prune an object after
all objects that refer to it have been pruned*. This could be done by
maintaining reference counts of the to-be-pruned objects and only
deleting an object once its reference count is zero.

The next point that I'm confused by is what happens when a new object or
reference is created while prune is running, and the new object or
reference refers to old objects. I think when we discussed this
privately I claimed that the following freshenings would not be
necessary, but now I think that they are (sorry about that!).

Let's take the simpler case first. Suppose I run the following command
between steps 1 and 3:

    git update-ref refs/heads/newbranch $COMMIT

, where $COMMIT is a previously-unreachable object. This doesn't affect
the mtime of $COMMIT, does it? So how does prune know that it shouldn't
delete $COMMIT?

-> So ISTM that updating a reference (or any other traversal starting
point, like the index) must freshen the mtime of any object newly
referred to.

A more complicated case: suppose I create a new $COMMIT referring to an
old $TREE during step 2, *after* prune has scanned the directory that
now contains $COMMIT. (I.e., the scan in step 2 never notices $COMMIT.)
Then I create a new reference pointing at $COMMIT. (I.e., the scan in
step 1 never noticed that the reference exists.) None of this affects
the mtime of $TREE, does it? So how does prune know that it mustn't
prune $TREE?

-> It seems to me that the creation of $COMMIT has to freshen the mtime
of $TREE, so that the final mtime check in step 3 realizes that it
shouldn't prune $TREE. Or to generalize, whenever a new object is
created which refers to existing objects, the direct referents of the
new object have to have their mtimes freshened. However, when an attempt
to write a new object accidentally coincides with an object that already
exists, *that* object needs to be freshened but *its* referents do *not*.

I hope I understood that all correctly...

Michael

-- 
Michael Haggerty
mhagger@xxxxxxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html