Re: Determining whether you have a commit locally, in a partial clone?

Tao Klerks <tao@xxxxxxxxxx> · Wed, 21 Jun 2023 12:10:33 +0200

On Wed, Jun 21, 2023 at 8:45 AM Jeff King <peff@xxxxxxxx> wrote:
>
> On Tue, Jun 20, 2023 at 09:12:24PM +0200, Tao Klerks wrote:
>
> > I'm back to begging for any hints here: Any idea how I can determine
> > whether a given commit object exists locally, *without causing it to
> > be fetched by the act of checking for it?*
>
> This is not very efficient, but:
>
>   git cat-file --batch-check='%(objectname)' --batch-all-objects --unordered |
>   grep $some_sha1
>
> will tell you whether we have the object locally.
>

Thanks so much for your help!

in Windows (msys or git bash) this is still very slow in my repo with
6,500,000 local objects - around 60s - but in linux on the same repo
it's quite a lot faster, at 5s. A large proportion of my users are on
Windows though, so I don't think this will be "good enough" for my
purposes, when I often need to check for the existence of dozens or
even hundreds of commits.

> I don't work with partial clones often, but it feels like being able to
> say:
>
>   git --no-partial-fetch cat-file ...
>
> would be a useful primitive to have.

It feels that way to me, yes!

On the other hand, I find very little demand for it when I search "the
internet" - or I don't know how to search for it.

> The implementation might start
> something like this:
>
> diff --git a/object-file.c b/object-file.c
> index 7c1af5c8db..494cdd7706 100644
> --- a/object-file.c
> +++ b/object-file.c
> @@ -1555,6 +1555,14 @@ void disable_obj_read_lock(void)
>
>  int fetch_if_missing = 1;
>
> +static int allow_lazy_fetch(void)
> +{
> +       static int ret = -1;
> +       if (ret < 0)
> +               ret = git_env_bool("GIT_PARTIAL_FETCH", 1);
> +       return ret;
> +}
> +
>  static int do_oid_object_info_extended(struct repository *r,
>                                        const struct object_id *oid,
>                                        struct object_info *oi, unsigned flags)
> @@ -1622,6 +1630,7 @@ static int do_oid_object_info_extended(struct repository *r,
>
>                 /* Check if it is a missing object */
>                 if (fetch_if_missing && repo_has_promisor_remote(r) &&
> +                   allow_lazy_fetch() &&
>                     !already_retried &&
>                     !(flags & OBJECT_INFO_SKIP_FETCH_OBJECT)) {
>                         promisor_remote_get_direct(r, real, 1);
>
> and then have git.c populate the environment variable, similar to how we
> handle --literal-pathspecs, etc.
>
> That fetch_if_missing kind of does the same thing, but it's mostly
> controlled by programs themselves which try to handle missing remote
> objects specially.

Thanks, I will play with this if I get the chance. That said, I don't
control my users' distributions of Git, so on a purely practical basis
I'm looking for something that will work in git 2.39 to whatever
future version would introduce such a capability. (before 2.39, the
"set remote to False" hack works)

> It does seem like you might be able to bend it to
> your will here, though. I think without any patches that:
>
>   git rev-list --objects --exclude-promisor-objects $oid
>
> will tell you whether we have the object or not (since it turns off
> fetch_if_missing, and thus will either succeed, printing nothing, or
> bail if the object can't be found).

This behaves in a way that I don't understand:

In the repo that I'm working in, this command runs successfully
*without fetching*, but it takes a *very* long time - 300+ seconds -
much longer than even the "inefficient" 'cat-file'-based printing of
all (6.5M) local object ids that you proposed above. I haven't
attempted to understand what's going on in there (besides running with
GIT_TRACE2_PERF, which showed nothing interesting), but the idea that
git would have to work super-hard to find an object by its ID seems
counter to everything I know about it. Would there be value in my
trying to understand & reproduce this in a shareable repo, or is there
already an explanation as to why this command could/should ever do
non-trivial work, even in the largest partial repos?

> It feels like --missing=error should
> function similarly, but it seems to still lazy-fetch (I guess since it's
> the default, the point is to just find truly unavailable objects). Using
> --missing=print disables the lazy-fetch, but it seems to bail
> immediately if you ask it about a missing object (I didn't dig, but my
> guess is that --missing is mostly about objects we traverse, not the
> initial tips).

Woah, "--missing=print" seems to work!!!

The following gives me the commit hash if I have it locally, and an
error otherwise - consistently across linux and windows, git versions
2.41, 2.39, 2.38, and 2.36 - without fetching, and without crazy
CPU-churning:

git rev-list --missing=print -1 $oid

Thank you thank you thank you!

I feel like I should try to work something into the doc about this,
but I'm not sure how to express this: "--missing=error is the default,
but it doesn't actually error out when you're explicitly asking about
a missing commit, it fetches it instead - but --missing=print actually
*does* error out if you explicitly ask about a missing commit" seems
like a strange thing to be saying.

Thanks again for finding me an efficient working strategy here!