Re: [PATCH v4] revision: add `--ignore-missing-links` user option

Karthik Nayak <karthik.188@xxxxxxxxx> · Thu, 21 Sep 2023 12:53:11 +0200

On Wed, Sep 20, 2023 at 5:32 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> There needs an explanation of interaction with --missing=<action>
> option here, no?  "--missing=allow-any" and "--missing=print" are
> sensible choices, I presume.  The former allows the traversal to
> proceed, as you described in one of your responses.  Also with
> "--missing=print", the user can more directly find out which are the
> missing objects, even without using the "--boundary" that requires
> them to sift between missing objects and the objects that are truly
> on boundary.
>
> Here is my attempt:
>
>         --ignore-missing-links::
>                 During traversal, if an object that is referenced does not
>                 exist, instead of dying of a repository corruption, allow
>                 `--missing=<missing-action>` to decide what to do.
>         +
>         `--missing=print` will make the command print a list of missing
>         objects, prefixed with a "?" character.
>         +
>         `--missing=allow-any` will make the command proceed without doing
>         anything special.  Used with `--boundary`, output these missing
>         objects mixed with the commits on the edge of revision ranges,
>         prefixed with a "-" character.
>
> It might make sense to add
>
>         +
>         Use of this option with other 'missing-action' may probably not
>         give useful behaviour.
>
> at the end, but it may not be useful to the readers to say "we allow
> even more extra flexibility but haven't thought through what good
> they would do".
>

I was thinking about this, but mostly didn't do this, because the
interaction with `--missing`
is only for non-commit objects. Because for missing commits,
`--ignore-missing-links` skips
the commit and the value of `--missing` doesn't make any difference.

It's only for non-commit objects that `--missing` comes into play. So
perhaps change the current
explanation to:

--ignore-missing-links::
       During traversal, if a commit that is referenced does not
       exist, instead of dying of a repository corruption, pretend as
       if the commit itself does not exist. Running the command
       with the `--boundary` option makes these missing commits,
       together with the commits on the edge of revision ranges
       (i.e. true boundary objects), appear on the output, prefixed
       with '-'.

This way `--ignore-missing-links` is specific to commits, combining
this with `--missing=...` for
non-commit objects is left to the user. What do you think?

> > +# With `--ignore-missing-links`, we stop the traversal when we encounter a
> > +# missing link. The boundary commit is not listed as we haven't used the
> > +# `--boundary` options.
> > +test_expect_success 'rev-list only prints main odb commits with --ignore-missing-links' '
> > +     hide_alternates &&
> > +
> > +     git -C alt rev-list --objects --no-object-names \
> > +             --ignore-missing-links --missing=allow-any HEAD >actual.raw &&
> > +     git -C alt cat-file  --batch-check="%(objectname)" \
> > +             --batch-all-objects >expect.raw &&
> > +
> > +     sort actual.raw >actual &&
> > +     sort expect.raw >expect &&
> > +     test_cmp expect actual
> > +'
>
> This gives a good baseline.  "--missing=print" without "--boundary"
> may have more obvious use cases, but is there a practical use case
> for the output from an invocation with "--missing=allow-any" without
> "--boundary"?  Just being curious if I am missing something obvious.
>

Not really, but it's easier to build up the testing, here without
boundary we can
use cat-file to test all objects (commits and others) that are output
by rev-list.

Then we can build on top of this in the next test, where we can also ensure that
boundary commits are printed. This however is very simplistic, as
you've mentioned.
There could be other objects and we don't really check.

> Perhaps add another test that uses "--missing=print" instead, and
> check that the "? missing" output matches what we expect to be
> missing?  The same comment applies to the other test that uses
> "--missing=allow-any" without "--boundary" we see later.
>

Sure, we can add that too!

> > +# With `--ignore-missing-links` and `--boundary`, we can even print those boundary
> > +# commits.
> > +test_expect_success 'rev-list prints boundary commit with --ignore-missing-links' '
> > +     git -C alt rev-list --ignore-missing-links --boundary HEAD >got &&
> > +     grep "^-$(git rev-parse HEAD)" got
> > +'
>
> This makes sure what we expect to appear in 'got' actually is in
> 'got', but we should also make sure 'got' does not have anything
> unexpected.
>

Yeah, I can add that in too.

> > +test_expect_success "setup for rev-list --ignore-missing-links with missing objects" '
> > +     show_alternates &&
> > +     test_commit -C alt 11
> > +'
> > +
> > +for obj in "HEAD^{tree}" "HEAD:11.t"
> > +do
> > +     # The `--ignore-missing-links` option should ensure that git-rev-list(1)
> > +     # doesn't fail when used alongside `--objects` when a tree/blob is
> > +     # missing.
> > +     test_expect_success "rev-list --ignore-missing-links with missing $type" '
> > +             oid="$(git -C alt rev-parse $obj)" &&
> > +             path="alt/.git/objects/$(test_oid_to_path $oid)" &&
> > +
> > +             mv "$path" "$path.hidden" &&
> > +             test_when_finished "mv $path.hidden $path" &&
>
> In the first iteration, we check without the tree object and we only
> ensure that removed tree does not appear in the output---but we know
> the blob that is referenced by that removed tree will not appear in
> the output, either, don't we?  Don't we want to check that, too?
>
> In the second iteration, we have resurrected the tree but removed
> the blob that is referenced by the tree, so we would not see that
> blob in the output, which makes sense.
>

I was implementing this change and just realized that for missing
trees, show_object() is
never called (that is --missing=print has no effect).

That means we only call show_object() when there is a missing blob.

So this effectively means:
1. missing commits: --ignore-missing-links works, --missing=... has no effect
2. missing trees:      --ignore-missing-links works, --missing=... has no effect
3. missing blobs:      --ignore-missing-links works in conjunction
with --missing=...

I now think it does make even more sense to hardcode the skipping of
`finish_object__ma` this way
we can state that `--ignore-missing-links` and `--missing` are
incompatible, wherein `--ignore-missing-links`
ignores any missing object (irrelevant of type) and `--missing` is
used to specifically handle missing blobs
and provides options.

This is also how currently `--boundary` and `--missing=print` is
specific to commits and blobs respectively.
What do you think?