Re: [PATCH v4 4/6] Add 'promisor-remote' capability to protocol v2

Patrick Steinhardt <ps@xxxxxx> · Thu, 30 Jan 2025 11:51:07 +0100

On Mon, Jan 27, 2025 at 04:16:59PM +0100, Christian Couder wrote:
> When a server S knows that some objects from a repository are available
> from a promisor remote X, S might want to suggest to a client C cloning
> or fetching the repo from S that C may use X directly instead of S for
> these objects.

A lot of the commit message seems to be duplicated with the technical
documentation that you add. I wonder whether it would make sense to
simply refer to that instead of repeating all of it? That would make it
easier to spot the actually-important bits in the commit message that
add context to the patch.

One very important bit of context that I was lacking is what exactly we
wire up and where we do so. I have been searching for longer than I want
to admit where the client ends up using the promisor remotes, until I
eventually figured out that the client-side isn't wired up at all. It
makes sense in retrospect, but it would've been nice if the reader was
guided a bit.

> diff --git a/Documentation/gitprotocol-v2.txt b/Documentation/gitprotocol-v2.txt
> index 1652fef3ae..f25a9a6ad8 100644
> --- a/Documentation/gitprotocol-v2.txt
> +++ b/Documentation/gitprotocol-v2.txt
> @@ -781,6 +781,60 @@ retrieving the header from a bundle at the indicated URI, and thus
>  save themselves and the server(s) the request(s) needed to inspect the
>  headers of that bundle or bundles.
>  
> +promisor-remote=<pr-infos>
> +~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +The server may advertise some promisor remotes it is using or knows
> +about to a client which may want to use them as its promisor remotes,
> +instead of this repository. In this case <pr-infos> should be of the
> +form:
> +
> +	pr-infos = pr-info | pr-infos ";" pr-info
> +
> +	pr-info = "name=" pr-name | "name=" pr-name "," "url=" pr-url
> +
> +where `pr-name` is the urlencoded name of a promisor remote, and
> +`pr-url` the urlencoded URL of that promisor remote.
> +
> +In this case, if the client decides to use one or more promisor
> +remotes the server advertised, it can reply with
> +"promisor-remote=<pr-names>" where <pr-names> should be of the form:
> +
> +	pr-names = pr-name | pr-names ";" pr-name
> +
> +where `pr-name` is the urlencoded name of a promisor remote the server
> +advertised and the client accepts.
> +
> +Note that, everywhere in this document, `pr-name` MUST be a valid
> +remote name, and the ';' and ',' characters MUST be encoded if they
> +appear in `pr-name` or `pr-url`.
> +
> +If the server doesn't know any promisor remote that could be good for
> +a client to use, or prefers a client not to use any promisor remote it
> +uses or knows about, it shouldn't advertise the "promisor-remote"
> +capability at all.
> +
> +In this case, or if the client doesn't want to use any promisor remote
> +the server advertised, the client shouldn't advertise the
> +"promisor-remote" capability at all in its reply.
> +
> +The "promisor.advertise" and "promisor.acceptFromServer" configuration
> +options can be used on the server and client side respectively to

s/respectively//, as you already say that in the next line.

> +control what they advertise or accept respectively. See the
> +documentation of these configuration options for more information.
> +
> +Note that in the future it would be nice if the "promisor-remote"
> +protocol capability could be used by the server, when responding to
> +`git fetch` or `git clone`, to advertise better-connected remotes that
> +the client can use as promisor remotes, instead of this repository, so
> +that the client can lazily fetch objects from these other
> +better-connected remotes. This would require the server to omit in its
> +response the objects available on the better-connected remotes that
> +the client has accepted. This hasn't been implemented yet though. So
> +for now this "promisor-remote" capability is useful only when the
> +server advertises some promisor remotes it already uses to borrow
> +objects from.

I'd leave away this bit as it doesn't really add a lot to the document.
It's a possibility for the future, but without it being implemented
anywhere it's not that helpful from my point of view.

> diff --git a/promisor-remote.c b/promisor-remote.c
> index c714f4f007..5ac282ed27 100644
> --- a/promisor-remote.c
> +++ b/promisor-remote.c
> @@ -292,3 +306,185 @@ void promisor_remote_get_direct(struct repository *repo,
>  	if (to_free)
>  		free(remaining_oids);
>  }
> +
> +static int allow_unsanitized(char ch)
> +{
> +	if (ch == ',' || ch == ';' || ch == '%')
> +		return 0;
> +	return ch > 32 && ch < 127;
> +}

Isn't this too lenient? It would allow also allow e.g. '=' and all kinds
of other characters. This does make sense for URLs, but it doesn't make
sense for remote names as they aren't supposed to contain punctuation in
the first place. So for these remote names I'd think we should be way
stricter and return an error in case they contain non-alphanumeric data.

> +static void promisor_info_vecs(struct repository *repo,
> +			       struct strvec *names,
> +			       struct strvec *urls)

I wonder whether it would make more sense to track these as a strmap
instead of two arrays which are expected to have related entries in the
same place.

> +{
> +	struct promisor_remote *r;
> +
> +	promisor_remote_init(repo);
> +
> +	for (r = repo->promisor_remote_config->promisors; r; r = r->next) {
> +		char *url;
> +		char *url_key = xstrfmt("remote.%s.url", r->name);
> +
> +		strvec_push(names, r->name);
> +		strvec_push(urls, git_config_get_string(url_key, &url) ? NULL : url);
> +
> +		free(url);
> +		free(url_key);
> +	}
> +}
> +
> +char *promisor_remote_info(struct repository *repo)
> +{
> +	struct strbuf sb = STRBUF_INIT;
> +	int advertise_promisors = 0;
> +	struct strvec names = STRVEC_INIT;
> +	struct strvec urls = STRVEC_INIT;
> +
> +	git_config_get_bool("promisor.advertise", &advertise_promisors);
> +
> +	if (!advertise_promisors)
> +		return NULL;
> +
> +	promisor_info_vecs(repo, &names, &urls);
> +
> +	if (!names.nr)
> +		return NULL;
> +
> +	for (size_t i = 0; i < names.nr; i++) {
> +		if (i)
> +			strbuf_addch(&sb, ';');
> +		strbuf_addstr(&sb, "name=");
> +		strbuf_addstr_urlencode(&sb, names.v[i], allow_unsanitized);
> +		if (urls.v[i]) {
> +			strbuf_addstr(&sb, ",url=");
> +			strbuf_addstr_urlencode(&sb, urls.v[i], allow_unsanitized);
> +		}
> +	}
> +
> +	redact_non_printables(&sb);

So here we replace non-printable characters with dots as far as I
understand. But didn't we just URL-encode the strings? So is there ever
a possibility for non-printable characters here?

> +	strvec_clear(&names);
> +	strvec_clear(&urls);
> +
> +	return strbuf_detach(&sb, NULL);
> +}
> +
> +enum accept_promisor {
> +	ACCEPT_NONE = 0,
> +	ACCEPT_ALL
> +};
> +
> +static int should_accept_remote(enum accept_promisor accept,
> +				const char *remote_name UNUSED,
> +				const char *remote_url UNUSED)
> +{
> +	if (accept == ACCEPT_ALL)
> +		return 1;
> +
> +	BUG("Unhandled 'enum accept_promisor' value '%d'", accept);
> +}
> +
> +static void filter_promisor_remote(struct strvec *accepted, const char *info)
> +{
> +	struct strbuf **remotes;
> +	const char *accept_str;
> +	enum accept_promisor accept = ACCEPT_NONE;
> +
> +	if (!git_config_get_string_tmp("promisor.acceptfromserver", &accept_str)) {
> +		if (!accept_str || !*accept_str || !strcasecmp("None", accept_str))
> +			accept = ACCEPT_NONE;
> +		else if (!strcasecmp("All", accept_str))
> +			accept = ACCEPT_ALL;
> +		else
> +			warning(_("unknown '%s' value for '%s' config option"),
> +				accept_str, "promisor.acceptfromserver");
> +	}
> +
> +	if (accept == ACCEPT_NONE)
> +		return;
> +
> +	/* Parse remote info received */
> +
> +	remotes = strbuf_split_str(info, ';', 0);
> +
> +	for (size_t i = 0; remotes[i]; i++) {
> +		struct strbuf **elems;
> +		const char *remote_name = NULL;
> +		const char *remote_url = NULL;
> +		char *decoded_name = NULL;
> +		char *decoded_url = NULL;
> +
> +		strbuf_strip_suffix(remotes[i], ";");
> +		elems = strbuf_split(remotes[i], ',');
> +
> +		for (size_t j = 0; elems[j]; j++) {
> +			int res;
> +			strbuf_strip_suffix(elems[j], ",");
> +			res = skip_prefix(elems[j]->buf, "name=", &remote_name) ||
> +				skip_prefix(elems[j]->buf, "url=", &remote_url);
> +			if (!res)
> +				warning(_("unknown element '%s' from remote info"),
> +					elems[j]->buf);
> +		}
> +
> +		if (remote_name)
> +			decoded_name = url_percent_decode(remote_name);
> +		if (remote_url)
> +			decoded_url = url_percent_decode(remote_url);

This is data we have received from a potentially-untrusted remote, so we
should double-check that the data we have received doesn't contain any
weird characters:

  - For the remote name we should verify that it consists only of
    alphanumeric characters.

  - For the remote URL we need to verify that it's a proper URL without
    any newlines, non-printable characters or anything else.

We'll eventually end up storing that data in the configuration, so these
verifications are quite important so that an adversarial server cannot
perform config-injection and thus cause remote code execution.

[snip]
> +void mark_promisor_remotes_as_accepted(struct repository *r, const char *remotes)
> +{
> +	struct strbuf **accepted_remotes = strbuf_split_str(remotes, ';', 0);
> +
> +	for (size_t i = 0; accepted_remotes[i]; i++) {
> +		struct promisor_remote *p;
> +		char *decoded_remote;
> +
> +		strbuf_strip_suffix(accepted_remotes[i], ";");
> +		decoded_remote = url_percent_decode(accepted_remotes[i]->buf);
> +
> +		p = repo_promisor_remote_find(r, decoded_remote);
> +		if (p)
> +			p->accepted = 1;
> +		else
> +			warning(_("accepted promisor remote '%s' not found"),
> +				decoded_remote);

My initial understanding of this code was that it is about the
client-side accepting a remote, but this is about the server-side and
tracks whether a promisor remote has been accepted by the client. It
feels a bit weird to modify semi-global state for this, as I'd have
rather expected that we pass around a vector of accepted remotes
instead.

But I guess ultimately this isn't too bad. It would be nice though if
it was more obvious whether we're on the server- or client-side.

> diff --git a/t/t5710-promisor-remote-capability.sh b/t/t5710-promisor-remote-capability.sh
> new file mode 100755
> index 0000000000..0390c1dbad
> --- /dev/null
> +++ b/t/t5710-promisor-remote-capability.sh
> @@ -0,0 +1,244 @@
[snip]
> +initialize_server () {
> +	count="$1"
> +	missing_oids="$2"
> +
> +	# Repack everything first
> +	git -C server -c repack.writebitmaps=false repack -a -d &&
> +
> +	# Remove promisor file in case they exist, useful when reinitializing
> +	rm -rf server/objects/pack/*.promisor &&
> +
> +	# Repack without the largest object and create a promisor pack on server
> +	git -C server -c repack.writebitmaps=false repack -a -d \
> +	    --filter=blob:limit=5k --filter-to="$(pwd)/pack" &&
> +	promisor_file=$(ls server/objects/pack/*.pack | sed "s/\.pack/.promisor/") &&
> +	>"$promisor_file" &&
> +
> +	# Check objects missing on the server
> +	check_missing_objects server "$count" "$missing_oids"
> +}
> +
> +copy_to_server2 () {

Nit: `server2` could be renamed to `promisor` to make the relation
between the two servers more obvious.

> diff --git a/upload-pack.c b/upload-pack.c
> index 728b2477fc..7498b45e2e 100644
> --- a/upload-pack.c
> +++ b/upload-pack.c
> @@ -319,6 +320,8 @@ static void create_pack_file(struct upload_pack_data *pack_data,
>  		strvec_push(&pack_objects.args, "--delta-base-offset");
>  	if (pack_data->use_include_tag)
>  		strvec_push(&pack_objects.args, "--include-tag");
> +	if (repo_has_accepted_promisor_remote(the_repository))
> +		strvec_push(&pack_objects.args, "--missing=allow-promisor");

This is nice and simple, I like it.

Patrick