Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

Jonathan Nieder <jrnieder@xxxxxxxxx> · Wed, 2 Aug 2017 10:38:57 -0700

Hi,

Junio C Hamano wrote:
> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:

>> One possibility to conceptually have the same thing without the overhead
>> of the list is to put the obtained-from-elsewhere objects into its own
>> alternate object store, so that we can distinguish the two.
>
> Now you are talking.  Either a separate object store, or a packfile
> that is specially marked as such, would work.

Jonathan's not in today, so let me say a few more words about this
approach.

This approach implies a relaxed connectivity guarantee, by creating
two classes of objects:

 1. Objects that I made should satisfy the connectivity check.  They
    can point to other objects I made, objects I fetched, or (*) objects
    pointed to directly by objects I fetched.  More on (*) below.

 2. Objects that I fetched do not need to satisfy a connectivity
    check.  I can trust the server to provide objects that they point
    to when I ask for them, except in extraordinary cases like a
    credit card number that was accidentally pushed to the server and
    prompted a rewriting of history to remove it (**).

The guarantee (1) looks like it should be easy to satisfy (just like
the current connectivity guarantee where all objects are in class (1)).
I have to know about an object to point to it --- that means the
pointed-to object has to be in the object store or pointed to by
something in the object store.

The complication is in the "git gc" operation for the case (*).
Today, "git gc" uses a reachability walk to decide which objects to
remove --- an object referenced by no other object is fair game to
remove.  With (*), there is another kind of object that must not be
removed: if an object that I made, M, points to a missing/promised
object, O, pointed to by a an object I fetched, F, then I cannot prune
F unless there is another fetched object present to anchor O.

For example: suppose I have a sparse checkout and run

	git fetch origin refs/pulls/x
	git checkout -b topic FETCH_HEAD
	echo "Some great modification" >> README
	git add README
	git commit --amend

When I run "git gc", there is nothing pointing to the commit that was
pointed to by the remote ref refs/pulls/x, so it can be pruned.  I
would naively also expect that the tree pointed to by that commit
could be pruned.  But pruning it means pruning the promise that made
it permissible to lack various blobs that my topic branch refers to
that are outside the sparse checkout area.  So "git gc" must notice
that it is not safe to prune that tree.

This feels hacky.  I prefer the promised object list over this
approach.

>                                                "Maintaining a list
> of object names in a flat file is too costly" is not a valid excuse
> to discard the integrity of locally created objects, without which
> Git will no longer be a version control system,

I am confused by this: I think that Git without a "git fsck" command
at all would still be a version control system, just not as good of
one.

Can you spell this out more?  To be clear, are you speaking as a
reviewer or as the project maintainer?  In other words, if other
reviewers are able to settle on a design that involves a relaxed
guarantee for fsck in this mode that they can agree on, does this
represent a veto meaning the patch can still not go through?

On one hand I'm grateful for the help exploring the design space, and
I think it has helped get a better understanding of the issues
involved.

On the other hand, if this is a veto then it feels very black and
white and a hard starting point to build a consensus from.  I am
worried.

[...]
>> I mentioned
>> this in my e-mail but rejected it, but after some more thought, this
>> might be sufficient - we might still need to iterate through every
>> object to know exactly what we can assume the remote to have, but the
>> "frontier" solution also needs this iteration, so we are no worse off.
>
> Most importantly, this is allowed to be costly---we are doing this
> not at runtime all the time, but when the user says "make sure that
> I haven't lost objects and it is safe for me to build further on
> what I created locally so far" by running "git fsck".

check_everything_connected is also used in some other circumstances:
e.g. when running a fetch, and when receiving a push in git
receive-pack.

Thanks,
Jonathan