Re: [PATCH 0/8] Speed up connectivity checks via quarantine dir

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 21 May 2021 06:45:02 +0900

Jeff King <peff@xxxxxxxx> writes:

> If we have an unreachable tree in the object database which references
> blobs we don't have, that doesn't make the repository corrupt. And with
> the current code, we would not accept a push that references that tree
> (unless it also pushes the necessary blobs). But after your patch, we
> would, and that would _make_ the repository corrupt.
>
> I will say that:
>
>   1. Modern versions of git-repack and git-prune try to keep even
>      unreachable parts of the graph complete (if we are keeping object X
>      that refers to Y, then we try to keep Y, too). But I don't know how
>      foolproof it is (certainly the traversal we do there is "best
>      effort"; if there's a missing reference that exists, we don't
>      bail).
>
>   2. This is not the only place that just checks object existence in the
>      name of speed. When updating a ref, for example, we only check that
>      the tip object exists.

There might be already other ways to corrupt repositories, and a
corrupted repository to be left unnoticed, in other words.

But that does not make it OK to add more ways to corrupt
repositories.

>   1. We could easily keep the original rule by just traversing the
>      object graph starting from the ref tips, as we do now, but ending
>      the traversal any time we hit an object that we already have
>      outside the quarantine.
>
>   2. This tightening is actually important if we want to avoid letting
>      people _intentionally_ introduce the unreachable-but-incomplete
>      scenario. Without it, an easy denial-of-service corruption against
>      a repository you can push to is:
>
>        - push an update to change a ref from X to Y. Include all objects
> 	 necessary for X..Y, but _also_ include a tree T which points to
> 	 a missing blob B. This will be accepted by the current rules
> 	 (but not by your patch).
>
>        - push an update to change the ref from Y to C, where C is a
> 	 commit whose root tree is T. Your patch allows this (because we
> 	 already have T in the repository). But the resulting repository
> 	 is corrupt (the ref now points to an incomplete object graph).

Hmph, the last step of that attack would not work with our current
check; is this the same new hole the series brings in as you
explained earlier for a case where a newly pushed tree/commit starts
to reference a left-over dangling tree already in the repository
whose content blobs are missing?

> If we wanted to keep the existing rule (requiring that any objects that
> sender didn't provide are actually reachable from the current refs),
> then we'd want to be able to check reachability quickly. And there I'd
> probably turn to reachability bitmaps.

True.  As we are not "performance is king---a code that corrupts
repositories as quickly as possible is an improvement" kind of
project, we should keep the existing "an object can become part of
DAG referred by ref tips only when the objects it refers to all
exist in the object store, because we want to keep the invariant: an
object that is reachable from a ref is guaranteed to have everything
reachable from it in the object store" rule, and find a way to make
it fast to enforce that rule somehow.

Thank for a review.