Re: [PATCH 13/16] prune: keep objects reachable from recent objects

Jeff King <peff@xxxxxxxx> · Fri, 3 Oct 2014 20:09:02 -0400

On Fri, Oct 03, 2014 at 02:47:57PM -0700, Junio C Hamano wrote:

> Sorry but this part is beyond a simple panda brain.

I probably didn't explain it very well. I found it rather confusing to
reason about. Let's see if we can go through it together.

> I can understand this
> 
> 	If we have an object, even if it is unreachable, we
> 	should have its referent.
>         
> as a description of the desired behaviour.  If we have a tree that
> is unreachable, we must make sure that we do not discard blobs that
> are reachable from that tree, or we would end up corrupting our
> repository if we ever allow that tree to become reachable from our
> refs later.

Yes, exactly.

> >   - if we are creating new objects, then we cannot create
> >     the parent object without having the child
> 
> We cannot create the parent (e.g. "tree") without having the child
> (e.g. "blob that is referred to by the tree we are creating").
> So this bullet point is repeating the same thing?

Sort of. The first statement "if we have an object, we should have its
referent" is a desired state. The bullet points are trying to reason
about the _changes_ in state, and argue that there are no state changes
that take us from a valid state to an invalid one.

There are two actions that affect the state of the object database:
adding objects and removing objects.

We cannot go from a valid state to an invalid state by adding objects,
because we cannot create the parent without having the child. That is
already the case before this patch (though as noted, you can "cheat"
if you know the sha1 from another repository and bypass git's safety
checks, but I do not think that is a case worth worrying about).

> >   - and if we are pruning objects, will not prune the child
> >     if we are keeping the parent
> 
> We will not prune "blob" that are reachable from a "tree" that we
> are not yet ready to prune.  So this again is repeating the same
> thing?

This is the other action. When removing objects, we will keep the child
along with the parent, and therefore we cannot move to an invalid state.
That's the part that this patch does.

> With this patch applied, the system will not prune unreachable old
> objects that are reachable from a recent object (the recent object
> itself may or may not be reachable but that does not make any
> difference).  And that is sufficient to ensure the integrity of the
> repository even if you allow new objects to be created reusing any
> of these unreachable objects that are left behind by prune, because
> the reachability check done during prune (with this patch applied)
> makes sure any object left in the repository can safely be used as a
> starting point of connectivity traversal.

Yes, exactly.

> Ok, I think I got it now, but then do we still need to utime(2) the
> loose object files for unreachable objects that are referenced by
> a recent object (which is done in a later patch), or is that purely
> an optimization for the next round of gc where you would have more
> recent objects (i.e. you do not have to traverse to find out an old
> one is reachable from a new one, as there will be fewer old ones)?

No, we don't need to utime() the loose objects. As long as there is a
recently-written object that refers to them, they are considered worth
keeping.

The later patch that calls utime() on objects is really about defeating
the write_sha1_file optimization. That is, you might _think_ you have
written a recent object that refers to other objects, but the sha1-file
code silently turned it into a noop.

Does that make more sense?

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html