Hi, Junio C Hamano wrote: > Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes: >> One possibility to conceptually have the same thing without the overhead >> of the list is to put the obtained-from-elsewhere objects into its own >> alternate object store, so that we can distinguish the two. > > Now you are talking. Either a separate object store, or a packfile > that is specially marked as such, would work. Jonathan's not in today, so let me say a few more words about this approach. This approach implies a relaxed connectivity guarantee, by creating two classes of objects: 1. Objects that I made should satisfy the connectivity check. They can point to other objects I made, objects I fetched, or (*) objects pointed to directly by objects I fetched. More on (*) below. 2. Objects that I fetched do not need to satisfy a connectivity check. I can trust the server to provide objects that they point to when I ask for them, except in extraordinary cases like a credit card number that was accidentally pushed to the server and prompted a rewriting of history to remove it (**). The guarantee (1) looks like it should be easy to satisfy (just like the current connectivity guarantee where all objects are in class (1)). I have to know about an object to point to it --- that means the pointed-to object has to be in the object store or pointed to by something in the object store. The complication is in the "git gc" operation for the case (*). Today, "git gc" uses a reachability walk to decide which objects to remove --- an object referenced by no other object is fair game to remove. With (*), there is another kind of object that must not be removed: if an object that I made, M, points to a missing/promised object, O, pointed to by a an object I fetched, F, then I cannot prune F unless there is another fetched object present to anchor O. For example: suppose I have a sparse checkout and run git fetch origin refs/pulls/x git checkout -b topic FETCH_HEAD echo "Some great modification" >> README git add README git commit --amend When I run "git gc", there is nothing pointing to the commit that was pointed to by the remote ref refs/pulls/x, so it can be pruned. I would naively also expect that the tree pointed to by that commit could be pruned. But pruning it means pruning the promise that made it permissible to lack various blobs that my topic branch refers to that are outside the sparse checkout area. So "git gc" must notice that it is not safe to prune that tree. This feels hacky. I prefer the promised object list over this approach. > "Maintaining a list > of object names in a flat file is too costly" is not a valid excuse > to discard the integrity of locally created objects, without which > Git will no longer be a version control system, I am confused by this: I think that Git without a "git fsck" command at all would still be a version control system, just not as good of one. Can you spell this out more? To be clear, are you speaking as a reviewer or as the project maintainer? In other words, if other reviewers are able to settle on a design that involves a relaxed guarantee for fsck in this mode that they can agree on, does this represent a veto meaning the patch can still not go through? On one hand I'm grateful for the help exploring the design space, and I think it has helped get a better understanding of the issues involved. On the other hand, if this is a veto then it feels very black and white and a hard starting point to build a consensus from. I am worried. [...] >> I mentioned >> this in my e-mail but rejected it, but after some more thought, this >> might be sufficient - we might still need to iterate through every >> object to know exactly what we can assume the remote to have, but the >> "frontier" solution also needs this iteration, so we are no worse off. > > Most importantly, this is allowed to be costly---we are doing this > not at runtime all the time, but when the user says "make sure that > I haven't lost objects and it is safe for me to build further on > what I created locally so far" by running "git fsck". check_everything_connected is also used in some other circumstances: e.g. when running a fetch, and when receiving a push in git receive-pack. Thanks, Jonathan