Re: [PATCH v2 0/5] Fsck for lazy objects, and (now) actual invocation of loader

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Mon, 31 Jul 2017 16:05:33 -0700

On Mon, 31 Jul 2017 14:21:56 -0700
Junio C Hamano <gitster@xxxxxxxxx> wrote:

> Jonathan Tan <jonathantanmy@xxxxxxxxxx> writes:
> 
> > Besides review changes, this patch set now includes my rewritten
> > lazy-loading sha1_file patch, so you can now do this (excerpted from one
> > of the tests):
> >
> >     test_create_repo server
> >     test_commit -C server 1 1.t abcdefgh
> >     HASH=$(git hash-object server/1.t)
> >     
> >     test_create_repo client
> >     test_must_fail git -C client cat-file -p "$HASH"
> >     git -C client config core.repositoryformatversion 1
> >     git -C client config extensions.lazyobject \
> >         "\"$TEST_DIRECTORY/t0410/lazy-object\" \"$(pwd)/server/.git\""
> >     git -C client cat-file -p "$HASH"
> >
> > with fsck still working. Also, there is no need for a list of promised
> > blobs, and the long-running process protocol is being used.
> 
> I do not think I read your response to my last comment on this
> series, so I could be missing something large, but I am afraid that
> the resulting fsck is only half as useful as the normal fsck.  I do
> not see it any better than a hypothetical castrated version of fsck
> that only checks the integrity of objects that appear in the local
> object store, without doing any connectivity checks.

Sorry, I haven't replied to your last response [1]. That does sound like
a good idea, though, and probably can be extended to trees and blobs in
that we need to make sure that any object referenced from local-only
commits (calculated as you describe in [1]) can be obtained through an
object walk from a remote-tracking branch.

I haven't fully thought of the implications of things like building
commits on top of an arbitrary upstream commit (so since our upstream
commit is not a tip, the object walk from all remote-tracking branches
might not reach our upstream commit).

To try to solve that, we could use an alternate object store to store
remote objects in order to be able to find remote objects quickly
without doing a traversal, but that does not fully solve the problem,
because some information about remote object possession lies only in
their parents (for example, if we don't have a remote blob, sometimes
the only way to know that the remote has it is by having a tree
containing that blob).

In addition, this also couples the lazy object loading with either a
remote ref (or all remote refs, if we decide to consider objects from
all remote refs as potentially loadable).

I'll think about this further.

[1] https://public-inbox.org/git/xmqq379fkz4x.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/

> Don't get me wrong.  The integrity check on local objects you still
> do is important---that is what allows us to make sure that the local
> "cache" does not prevent us from going to the real source of the
> remote object store by having a corrupt copy.  
> 
> But not being able to tell if a missing object is OK to be missing
> (because we can get them if/as needed from elsewhere) or we lost the
> sole copy of an object that we created and have not pushed out
> (hence we are in deep yogurt) makes it pretty much pointless to run
> "fsck", doesn't it?  It does not give us any guarantee that our
> repository plus perfect network connectivity gives us an environment
> to build further work on.
> 
> Or am I missing something fundamental?

Well, the fsck can still detect issues like corrupt objects (as you
mention above) and dangling heads, which might be real issues. But it is
true that it does not give you the guarantee you describe.

>From a user standpoint, this might be able to be worked around by
providing a network-requiring object connectivity checking tool or by
just having the user running a build to ensure that all necessary files
are present.

Having said that, this feature will be very nice to have.