Re: [PATCH 2/2] fsck: lazily load types under --connectivity-only

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 26 Jan 2017 10:51:00 -0800

Jeff King <peff@xxxxxxxx> writes:

> The recent fixes to "fsck --connectivity-only" load all of
> the objects with their correct types. This keeps the
> connectivity-only code path close to the regular one, but it
> also introduces some unnecessary inefficiency. While getting
> the type of an object is cheap compared to actually opening
> and parsing the object (as the non-connectivity-only case
> would do), it's still not free.
>
> For reachable non-blob objects, we end up having to parse
> them later anyway (to see what they point to), making our
> type lookup here redundant.
>
> For unreachable objects, we might never hit them at all in
> the reachability traversal, making the lookup completely
> wasted. And in some cases, we might have quite a few
> unreachable objects (e.g., when alternates are used for
> shared object storage between repositories, it's normal for
> there to be objects reachable from other repositories but
> not the one running fsck).
>
> The comment in mark_object_for_connectivity() claims two
> benefits to getting the type up front:
>
>   1. We need to know the types during fsck_walk(). (And not
>      explicitly mentioned, but we also need them when
>      printing the types of broken or dangling commits).
>
>      We can address this by lazy-loading the types as
>      necessary. Most objects never need this lazy-load at
>      all, because they fall into one of these categories:
>
>        a. Reachable from our tips, and are coerced into the
> 	  correct type as we traverse (e.g., a parent link
> 	  will call lookup_commit(), which converts OBJ_NONE
> 	  to OBJ_COMMIT).
>
>        b. Unreachable, but not at the tip of a chunk of
>           unreachable history. We only mention the tips as
> 	  "dangling", so an unreachable commit which links
> 	  to hundreds of other objects needs only report the
> 	  type of the tip commit.
>
>   2. It serves as a cross-check that the coercion in (1a) is
>      correct (i.e., we'll complain about a parent link that
>      points to a blob). But we get most of this for free
>      already, because right after coercing, we'll parse any
>      non-blob objects. So we'd notice then if we expected a
>      commit and got a blob.
>
>      The one exception is when we expect a blob, in which
>      case we never actually read the object contents.
>
>      So this is a slight weakening, but given that the whole
>      point of --connectivity-only is to sacrifice some data
>      integrity checks for speed, this seems like an
>      acceptable tradeoff.

The only weakening is that a non-blob (or a corrupt blob) object
that sits where we expect a blob (because the containing tree or the
tag says so) would not be diagnosed as an error, then?  I think that
is in line with the spirit of --connectivity-only and is a good
trade-off.