On 10/03/2014 10:39 PM, Jeff King wrote: > [...] > Instead, this patch pushes the extra work onto prune, which > runs less frequently (and has to look at the whole object > graph anyway). It creates a new category of objects: objects > which are not recent, but which are reachable from a recent > object. We do not prune these objects, just like the > reachable and recent ones. > > This lets us avoid the recursive check above, because if we > have an object, even if it is unreachable, we should have > its referent: > > - if we are creating new objects, then we cannot create > the parent object without having the child > > - and if we are pruning objects, will not prune the child > if we are keeping the parent > > The big exception would be if one were to write the object > in a way that avoided referential integrity (e.g., using > hash-object). But if you are in the habit of doing that, you > deserve what you get. > > Naively, the simplest way to implement this would be to add > all recent objects as tips to the reachability traversal. > However, this does not perform well. In a recently-packed > repository, all reachable objects will also be recent, and > therefore we have to consider each object twice (both as a > tip, and when we reach it in the traversal). I tested this, > and it added about 10s to a 30s prune on linux.git. This > patch instead performs the normal reachability traversal > first, then follows up with a second traversal for recent > objects, skipping any that have already been marked. I haven't read all of the old code, but if I understand correctly this is your new algorithm: 1. Walk from all references etc., marking reachable objects. 2. Iterate over *all* objects, in no particular order, skipping the objects that are already known to be reachable. Use any unreachable object that has a recent mtime as a tip for a second traversal that marks all of its references as "to-keep". 3. Iterate over any objects that are not marked "to-keep". (I assume that this iteration is in no particular order.) For each object: * [Presumably] verify that its mtime is still "old" * If so, prune the object I see some problems with this. * The one that you mentioned in your cover letter, namely that prune's final mtime check is not atomic with the object deletion. I agree that this race is so much shorter than the others that we can accept a solution that doesn't eliminate it, so let's forget about this one. * If the final mtime check fails, then the object is recognized as new and not pruned. But does that prevent its referents from being pruned? * When this situation is encountered, you would have to start another object traversal starting at the "renewed" object to mark its referents "to-keep". I don't see that you do this. Another, much less attractive alternative would be to abort the prune operation if this situation arises. But even if you do one of these... * ...assuming the iteration in step 3 is in no defined order, a referent might *already* have been pruned before you notice the "renewed" object. So although your changes are a big improvement, it seems to me that they still leave a race with a window approximately as long as the time it takes to scan and prune the unreachable objects. I think that the only way to get rid of that race is to delete objects in parent-to-child order; in other words, *only prune an object after all objects that refer to it have been pruned*. This could be done by maintaining reference counts of the to-be-pruned objects and only deleting an object once its reference count is zero. The next point that I'm confused by is what happens when a new object or reference is created while prune is running, and the new object or reference refers to old objects. I think when we discussed this privately I claimed that the following freshenings would not be necessary, but now I think that they are (sorry about that!). Let's take the simpler case first. Suppose I run the following command between steps 1 and 3: git update-ref refs/heads/newbranch $COMMIT , where $COMMIT is a previously-unreachable object. This doesn't affect the mtime of $COMMIT, does it? So how does prune know that it shouldn't delete $COMMIT? -> So ISTM that updating a reference (or any other traversal starting point, like the index) must freshen the mtime of any object newly referred to. A more complicated case: suppose I create a new $COMMIT referring to an old $TREE during step 2, *after* prune has scanned the directory that now contains $COMMIT. (I.e., the scan in step 2 never notices $COMMIT.) Then I create a new reference pointing at $COMMIT. (I.e., the scan in step 1 never noticed that the reference exists.) None of this affects the mtime of $TREE, does it? So how does prune know that it mustn't prune $TREE? -> It seems to me that the creation of $COMMIT has to freshen the mtime of $TREE, so that the final mtime check in step 3 realizes that it shouldn't prune $TREE. Or to generalize, whenever a new object is created which refers to existing objects, the direct referents of the new object have to have their mtimes freshened. However, when an attempt to write a new object accidentally coincides with an object that already exists, *that* object needs to be freshened but *its* referents do *not*. I hope I understood that all correctly... Michael -- Michael Haggerty mhagger@xxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html