Re: [PATCH v3] promisor-remote: fix segfault when remote URL is missing

Christian Couder <christian.couder@xxxxxxxxx> · Thu, 13 Mar 2025 11:39:58 +0100

On Wed, Mar 12, 2025 at 6:02 PM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Christian Couder <christian.couder@xxxxxxxxx> writes:
>
> >       for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> > +             const char *url;
> >               char *url_key = xstrfmt("remote.%s.url", r->name);
> >
> >               strvec_push(names, r->name);
> >
> > +             /*
> > +              * No URL defaults to the name of the remote, like
> > +              * elsewhere in Git (e.g. `git fetch` or `git remote
> > +              * get-url`). It's still possible that an empty URL is
> > +              * configured.
> > +              */
>
> Not a huge deal as it is not telling any lies, but does the second
> sentence need to be said?  An element in the urls strvec being an
> empty string is not all that more interesting than it being an
> incorrect or malformed URL to those who are reading this piece of
> code, is it?  It is also possible that an unreachable URL or
> misspelt URL is configured, but it is not a job of this piece of
> code to worry about them, just like it is none of the business of
> this code if the configured URL is an empty string, no?

Yeah, right, I have removed the second sentence in the next version.

> > +             strvec_push(urls, git_config_get_string_tmp(url_key, &url) ? r->name : url);
>
> More on this below.  Unlike "git fetch" and "git push" used as the
> source and destination, the remote URL used in this context are
> exposed to the outside world, and I am not sure the usual r->name
> fallback makes sense.
>
> >               free(url_key);
> >       }
> >  }
> > @@ -356,7 +362,7 @@ char *promisor_remote_info(struct repository *repo)
> >                       strbuf_addch(&sb, ';');
> >               strbuf_addstr(&sb, "name=");
> >               strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> > -             if (urls.v[i]) {
> > +             if (*urls.v[i]) {
> >                       strbuf_addstr(&sb, ",url=");
> >                       strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
>
> We used to advertise an empty string name to the other end, but we
> no longer do, which is a good hygiene to be strict on what we send
> out.
>
> But now our updated promisor_info_vecs() pushes our local name
> r->name as a fallback. The idea of r->name fallback is to use it as
> a local directory path for "git fetch" and friends, but the local
> pathname has no meaning to the other side, does it?  Is it something
> we want to let the other side even know???

It could happen that the server, the client and the common promisor
remote are all on the same filesystem. Then it would make sense for
both the server and the client to rely on just the remote name,
without any URL configured, to access the promisor remote. So if we
want things to work in this case, then I think the server should
advertise the remote name in the "url=" field.

Also it's not like the server is giving away secret information as it
already passes the remote name anyway in the "name=" field.

And yeah, the client could be configured with "KnownName" instead of
"KnownURL" in this case, but that wouldn't work if there are other
promisor remotes that the client and the server want to share and that
are not local and therefore need a URL configured on both sides.

> What other uses do the name/url vectors prepared by
> promisor_info_vecs() have?  Is it that we use them only to advertise
> with this code, and then match with what they advertise?

Yes, I think so.

> If we are
> not using these names and urls locally to fetch from in code paths,
> I am inclined to suggest that promisor_info_vecs() should not shove
> these fallback URLs (local directory name implicitly inferred) into
> the names/urls vectors.

We could do that but I think it would make it more difficult to make
things work in the case I discussed above (where the client and the
common promisor remote are all on the same filesystem, and both the
server and the client rely on just the remote name to access the
promisor remote).

> On the other hand, if other callsites that use the names/urls
> obtained from that function do want to see such local pathnames, we
> cannot lose information at the source, so we'd somehow need to
> filter them at various places, I guess.  And this place that builds
> up the string to be sent as capability response should be one of
> these places that must filter.

Other call sites don't use promisor_info_vecs(). It was introduced by
the lop patch series which doesn't change how other code gets the
remote names and URLs.

> > @@ -409,12 +415,42 @@ static int should_accept_remote(enum accept_promisor accept,
> >       if (accept != ACCEPT_KNOWN_URL)
> >               BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> >
> > +     if (!remote_url) {
> > +             warning(_("no URL advertised for remote '%s'"), remote_name);
> > +             return 0;
> > +     }
>
> Except for the above "no URL advertised" warning and returning,
> which is absolutely a good thing to do, I am still not sure how
> relevant various checks for an empty string new code added by this
> patch makes are ...
>
> > +     if (!*remote_url) {
> > +             /*
> > +              * This shouldn't happen with a Git server, but not
> > +              * sure how other servers will be implemented in the
> > +              * future.
> > +              */
> > +             warning(_("empty URL advertised for remote '%s'"), remote_name);
> > +             return 0;
> > +     }
> > +
> > +     if (!*urls->v[i]) {
> > +             warning(_("empty URL configured for remote '%s'"), remote_name);
> > +             return 0;
> > +     }
> > +
>
> ... would it be so different to pass an empty string as to pass a
> misspelt URL received from the other end?  Wouldn't the end result
> the same (i.e., we thought we had a URL usable as a promisor remote,
> but it turns out that we cannot reach it)?

Perhaps but I think it would be weird if URLs are matching when they
are empty on both sides. I think it makes more sense and is more
helpful to warn with a clear error message and just reject the remote
if any of the URL is empty.

> >       if (!strcmp(urls->v[i], remote_url))
> >               return 1;
>
> Past this point, I am not sure what the points of these checks and
> warnings are; even with these "problematic" remote_name and remote_url
> combinations these warnings attempt to warn against are used, as long
> as the above check said it is OK, we'd silently said "should accept"
> already to the caller.

Past this point we are in the case where remote names matched but
remote URLs didn't match. So I think we should help diagnose things
(with warnings) because it's likely that the intent was for the URL to
also match but a mistake prevented that from happening.

> > -     warning(_("known remote named '%s' but with url '%s' instead of '%s'"),
> > +     warning(_("known remote named '%s' but with URL '%s' instead of '%s'"),
> >               remote_name, urls->v[i], remote_url);
> >
> > +     if (!strcmp(remote_name, urls->v[i]))
>
> The 'i' was obtained by calling remote_nick_find(), which uses
> strcasecmp() to find named remote (which I doubt it is a sensible
> design by the way).  This code should be consistent with whatever
> comparison used there.

I think comparing remote names case insensitively is fair. It's likely
to just make things a bit easier for users.

In the next version, I have changed the comparison to strcasecmp()
here as I agree it could help if the comparisons are consistent.

> > +             warning(_("remote name and URL are the same '%s', "
> > +                       "maybe the URL is not configured locally"),
> > +                     remote_name);
> > +
> > +     if (!strcmp(remote_name, remote_url))
>
> This is matching what r->name fallback did so it is correct to be
> strcmp().  But (1) it may be way too late after the above "return
> 1", and (2) if we are *not* going to use it, perhaps we shouldn't
> place it in the resulting strvec from promisor_info_vecs() in the
> first place?

We are in the case the URLs didn't match. So yeah we are not going to
use the remote info because we are going to reject the remote (with
`return 0;`) a few lines below. But it would be nice if we can help
users diagnose what happened.

If we notice that remote_name and remote_url are the same it might be
because the URL is not configured on the server side, so the server
passed the remote name instead. It can be nice to tell users that this
might have happened to help them debug.

> > +             warning(_("remote name and URL are the same '%s', "
> > +                       "maybe the URL is not configured on the remote side"),
> > +                     remote_name);
> > +
> >       return 0;
> >  }