On Tue, Mar 22, 2022 at 04:04:53PM -0700, Jonathan Nieder wrote: > Hi, > > Taylor Blau wrote: > > On Tue, Mar 22, 2022 at 02:45:16PM -0700, Jonathan Nieder wrote: > > >> Great. Can the doc cover this? I think it would be helpful to make > >> that easy to find for others with similar questions. > > > > I believe the doc covers this already, see the paragraph beginning with > > "Unreachable objects aren't removed immediately...". > > Thanks. I just reread that section and it didn't say anything obvious > about the race that continues to exist and whether cruft packs address > it. Yeah, there isn't an explicit "and cruft packs addresses the loose object explosion but does not address the race" sentence. I'm not opposed to adding something like that to clarify (though TBH, I would rather do it as a clean-up on top rather than send out a bazillion mostly-unchanged patches). > [...] > >> Sorry, the above seems like it's answering a different question than I > >> asked. The doc in Documentation/technical/ seems like a natural place > >> to describe what semantics the new .mtimes file has, and I didn't find > >> that there. Is there a different piece of documentation I should have > >> been looking at? > > > > Are you looking for a technical description of the mtimes file? If so, > > there is a section in Documentation/technical/pack-format.txt (added in > > "pack-mtimes: support reading .mtimes files") that explains this. > > I see --- is the idea that cruft-packs.txt means to refer to > pack-format.txt for the details, and cruft-packs is an overview of some > other, non-detail aspects? > > I just checked pack-format.txt and it didn't describe the semantics > (what a Git implementation is expected to do when it sees an mtimes > file). For example, in Documentation/technical/cruft-packs.txt, the > kind of thing I'd expect to see is Right; I have always considered the files in Documentation/technical to primarily be about the file format itself. > - what does an mtime value in the mtimes file represent? When is it > meant to be updated? > - what guarantees are present about when an object is safe to be > pruned? The cruft-packs.txt document covers these, though I think somewhat implicitly. Again, I'm not opposed to more clarification, but I again would like to do so on top. I think many of these are discussed within the threads above, but to answer your questions in order: - The mtime of an object in a cruft pack represents the last time that object was known to be reachable, and it's updated when generating a cruft pack or pruning. - The same guarantees are made in the cruft pack case as in the non-cruft case (i.e., "none", and so a grace period is recommended). > [...] > >> Can you tell me a little more about why we would want _not_ to have a > >> repository format extension? To me, it seems like a fairly simple > >> addition that would drastically reduce the cognitive overload for > >> people considering making use of this feature. > > > >There is no reason to prevent a pre-cruft packs version of Git from > > reading/writing a repository that uses cruft packs, since the two > > versions will still function as normal. Since there's no need to prevent > > the old version from interacting with a repository that has cruft packs, > > we wouldn't want to enforce an unnecessary boundary with an extension. > > Does "function as normal" include in repository maintenance operations > like "git maintenance", "git gc", and "git prune"? If so, this seems > like something very useful to describe in the cruft-packs.txt > document, since what happens when we bounce back and forth between old > and new versions of Git operating on the same NFS mounted repository > would not be obvious without such a discussion. Yes, all of those commands will simply ignore the .mtimes file and treat the unreachable objects as normal (where "normal" means in the exact same way as they currently do without cruft packs). I think adding a section that summarizes our discussion would be useful. > I'm still interested in the _downsides_ of using a repository format > extension. "There is no reason" is not a downside, unless you mean > that it requires adding a line of code. :) The main downside I can > imagine is that it prevents accessing the repository _that has enabled > this feature_ with an older version of Git, but I (perhaps due to a > failure of imagination) haven't put two and two together yet about > when I would want to do so. Sorry for not being clear; I meant: "There is no reason [to prohibit two versions of Git from interacting with each other when they are compatible to do so]". > [...] > > Expiring objects in piecemeal is somewhat interesting, but I think I was > > reaching a little too far when I said it was the "key benefit". It does > > have some nice properties, like being able to store cruft objects as > > deltas against other cruft objects which might get pruned at a different > > time (though, of course, you'll need to re-delta them in the case you do > > prune an object which is the base of another cruft object). > > > > But the issue with having multiple cruft packs is that the semantics get > > significantly more complicated. E.g., if you have an object represented > > in multiple cruft packs, which mtime do you use? If you want to prune > > it, you suddenly may have many packs you need to update and keep track > > of. > > Thanks for this explanation. In hash-function-transition.txt, I see > > "git gc" currently expels any unreachable objects it encounters in > pack files to loose objects in an attempt to prevent a race when > pruning them (in case another process is simultaneously writing a new > object that refers to the about-to-be-deleted object). This leads to > an explosion in the number of loose objects present and disk space > usage due to the objects in delta form being replaced with independent > loose objects. Worse, the race is still present for loose objects. > > Instead, "git gc" will need to move unreachable objects to a new > packfile marked as UNREACHABLE_GARBAGE (using the PSRC field; see > below). To avoid the race when writing new objects referring to an > about-to-be-deleted object, code paths that write new objects will > need to copy any objects from UNREACHABLE_GARBAGE packs that they > refer to new, non-UNREACHABLE_GARBAGE packs (or loose objects). > UNREACHABLE_GARBAGE are then safe to delete if their creation time (as > indicated by the file's mtime) is long enough ago. > > To avoid a proliferation of UNREACHABLE_GARBAGE packs, they can be > combined under certain circumstances. [etc] > > So the proposal there is that the file mtime for an UNREACHABLE_GARBAGE > pack refers to when that pack was written and governs when that pack > can be deleted. If an object is present in multiple packs, then newer > packs with the object have a newer mtime and thus cause the object to > be kept around for longer. That matches my understanding. Thanks, Taylor