On Mon, Mar 07, 2022 at 10:03:35AM -0800, Jonathan Nieder wrote: > Sorry for the very slow review! I've mentioned a few times that this > overlaps in interesting ways with the gc mechanism described in > hash-function-transition.txt, so I'd like to compare and see how they > interact. Sorry for my equally-slow reply ;). I was on vacation last week and wasn't following the list closely. > > +Unreachable objects aren't removed immediately, since doing so could race with > > +an incoming push which may reference an object which is about to be deleted. > > +Instead, those unreachable objects are stored as loose object and stay that way > > +until they are older than the expiration window, at which point they are removed > > +by linkgit:git-prune[1]. > > + > > +Git must store these unreachable objects loose in order to keep track of their > > +per-object mtimes. > > It's worth noting that this behavior is already racy. That is because > when an unreachable object becomes newly reachable, we do not update > its mtime and the mtimes of every object reachable from it, so if it > then becomes transiently unreachable again then it can be wrongly > collected. Just to be clear, the race here only happens if the object in question becomes reachable _after_ a pruning GC determines its mtime. If that's the case, then the object will indeed be wrongly collected. This is consistent with the existing behavior (which is racy in the exact same way). (After re-reading what you wrote and my response, I think we are saying the exact same thing, but it doesn't hurt to think aloud). > > + > > +== Cruft packs > > + > > +A cruft pack eliminates the need for storing unreachable objects in a loose > > +state by including the per-object mtimes in a separate file alongside a single > > +pack containing all loose objects. > > Can this doc say a little about how "git prune" handles these files? > In particular, does a non cruft pack aware copy of Git (or JGit, > libgit2, etc) do the right thing or does it fight with this mechanism? > If the latter, do we have a repository extension (extensions.*) to > prevent that? I mentioned this in much more detail in [1], but the answer is that the cruft pack looks like any other pack, it just happens to have another metadata file (the .mtimes one) attached to it. So other implementations of Git should treat it as they would any other pack. Like I mentioned in [1], cruft packs were designed with the explicit goal of not requiring a repository extension. > > + 3. Write the pack out, along with a `.mtimes` file that records the per-object > > + timestamps. > > As a point of comparison, the design in hash-function-transition uses > a single timestamp for the whole pack. During read operations, objects > in a cruft pack are considered present; during writes, they are > considered _not present_ so that if we want to make a cruft object > newly present then we put a copy of it in a new pack. > > Advantage of the mtimes file approach: > - less duplication of storage: a revived object is only stored once, > in a cruft pack, and then the next gc can "graduate" it out of the > cruft pack and shrink the cruft pack > - less affect on non-gc Git code: writes don't need to know that any > cruft objects referenced need to be copied into a new pack > > Advantages of the mtime per cruft pack approach: > - easy expiration: once a cruft pack has reached its expiration date, > it can be deleted as a whole > - less I/O churn: a cruft pack stays as-is until combined into another > cruft pack or deleted. There is no frequently-modified mtimes file > associated to it > - informs the storage layer about what is likely to be accessed: cruft > packs can get filesystem attributes to put them in less-optimized > storage since they are likely to be less frequently read > > [...] The key advantage of cruft packs is that you can expire unreachable objects in piecemeal while still retaining the benefit of being able to de-duplicate cruft objects and store them packed against each other. > > +Notable alternatives to this design include: > > This doesn't mention the approach described in > hash-function-transition.txt (and that's already implemented and has > been in use for many years in JGit's DfsRepository). Does that mean > you aren't aware of it? Implementing the UNREACHABLE_GARBAGE concept from hash-function-transition.txt in cruft pack-terms would be equivalent to not writing the mtimes file at all. This follows from the fact that a pre-cruft packs implementation of Git considers a packed object's mtime to be the same as the pack it's contained in. (I'm deliberately avoiding any details from the h-f-t document regarding re-writing objects contained in a garbage pack here, since this is separate from the pack structure itself (and could easily be implemented on top of cruft packs)). So I'm not sure what the alternative we'd list would be, since it removes the key feature of the design of cruft packs. Thanks, Taylor [1]: https://lore.kernel.org/git/YiZMhuI%2FDdpvQ%2FED@nand.local/