Jeff King <peff@xxxxxxxx> writes: > The recent fixes to "fsck --connectivity-only" load all of > the objects with their correct types. This keeps the > connectivity-only code path close to the regular one, but it > also introduces some unnecessary inefficiency. While getting > the type of an object is cheap compared to actually opening > and parsing the object (as the non-connectivity-only case > would do), it's still not free. > > For reachable non-blob objects, we end up having to parse > them later anyway (to see what they point to), making our > type lookup here redundant. > > For unreachable objects, we might never hit them at all in > the reachability traversal, making the lookup completely > wasted. And in some cases, we might have quite a few > unreachable objects (e.g., when alternates are used for > shared object storage between repositories, it's normal for > there to be objects reachable from other repositories but > not the one running fsck). > > The comment in mark_object_for_connectivity() claims two > benefits to getting the type up front: > > 1. We need to know the types during fsck_walk(). (And not > explicitly mentioned, but we also need them when > printing the types of broken or dangling commits). > > We can address this by lazy-loading the types as > necessary. Most objects never need this lazy-load at > all, because they fall into one of these categories: > > a. Reachable from our tips, and are coerced into the > correct type as we traverse (e.g., a parent link > will call lookup_commit(), which converts OBJ_NONE > to OBJ_COMMIT). > > b. Unreachable, but not at the tip of a chunk of > unreachable history. We only mention the tips as > "dangling", so an unreachable commit which links > to hundreds of other objects needs only report the > type of the tip commit. > > 2. It serves as a cross-check that the coercion in (1a) is > correct (i.e., we'll complain about a parent link that > points to a blob). But we get most of this for free > already, because right after coercing, we'll parse any > non-blob objects. So we'd notice then if we expected a > commit and got a blob. > > The one exception is when we expect a blob, in which > case we never actually read the object contents. > > So this is a slight weakening, but given that the whole > point of --connectivity-only is to sacrifice some data > integrity checks for speed, this seems like an > acceptable tradeoff. The only weakening is that a non-blob (or a corrupt blob) object that sits where we expect a blob (because the containing tree or the tag says so) would not be diagnosed as an error, then? I think that is in line with the spirit of --connectivity-only and is a good trade-off.