Re: [PATCH v3 4/4] urlmatch: allow globbing for the URL host part

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 26 Jan 2017 12:43:31 -0800

Patrick Steinhardt <patrick.steinhardt@xxxxxxxx> writes:

> The URL matching function computes for two URLs whether they match not.
> The match is performed by splitting up the URL into different parts and
> then doing an exact comparison with the to-be-matched URL.
>
> The main user of `urlmatch` is the configuration subsystem. It allows to
> set certain configurations based on the URL which is being connected to
> via keys like `http.<url>.*`. A common use case for this is to set
> proxies for only some remotes which match the given URL. Unfortunately,
> having exact matches for all parts of the URL can become quite tedious
> in some setups. Imagine for example a corporate network where there are
> dozens or even hundreds of subdomains, which would have to be configured
> individually.
>
> This commit introduces the ability to use globbing in the host-part of
> the URLs. A user can simply specify a `*` as part of the host name to
> match all subdomains at this level. For example adding a configuration
> key `http.https://*.example.com.proxy` will match all subdomains of
> `https://example.com`.

This is probably a useful improvement.

Having said that, when I mentioned "glob", I meant to also support
something like this:

	https://www[1-4].ibm.com/

And when people read "glob", that is what they expect.

So calling this "the ability to use globbing" is misleading.
The last paragraph in the log message above needs a bit of
tweaking, perhaps like this:

	Allow users to write an asterisk '*' in place of any 'host'
	or 'subdomain' label as part of the host name.  For example,
	"http.https://*.example.com.proxy"; sets "http.proxy" for all
	direct subdomains of "https://example.com";,
	e.g. "https://foo.example.com";, but not
	"https://foo.bar.example.com";.

Fortunately, your update to config.txt, which is facing the end
users, does not misuse the word and instead is explicit that the
only thing the matcher does is to match '*' to a single hierarchy.
It is clear that even http://www*.ibm.com/ is not supported from
the description, which is good.

>  . Host/domain name (e.g., `example.com` in `https://example.com/`).
> -  This field must match exactly between the config key and the URL.
> +  This field must match between the config key and the URL. It is
> +  possible to specify a `*` as part of the host name to match all subdomains
> +  at this level. `https://*.example.com/` for example would match
> +  `https://foo.example.com/`, but not `https://foo.bar.example.com/`.

This is good as-is.

>  . Port number (e.g., `8080` in `http://example.com:8080/`).
>    This field must match exactly between the config key and the URL.
> diff --git a/t/t1300-repo-config.sh b/t/t1300-repo-config.sh
> index 923bfc5a2..ec545e092 100755
> --- a/t/t1300-repo-config.sh
> +++ b/t/t1300-repo-config.sh
> @@ -1177,6 +1177,42 @@ test_expect_success 'urlmatch' '
>  	test_cmp expect actual
>  '
>  
> +test_expect_success 'glob-based urlmatch' '

This is not "glob".  A more generic term "wildcard" is OK.

> +	cat >.git/config <<-\EOF &&
> +	[http]
> +		sslVerify
> ...
> +static int match_host(const struct url_info *url_info,
> +		      const struct url_info *pattern_info)
> +{
> +	char *url = xmemdupz(url_info->url + url_info->host_off, url_info->host_len);
> +	char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, pattern_info->host_len);
> +	char *url_tok, *pat_tok, *url_save, *pat_save;
> +	int matching;
> +
> +	url_tok = strtok_r(url, ".", &url_save);
> +	pat_tok = strtok_r(pat, ".", &pat_save);

Hmph, this will be the first use of strtok_r() in our codebase.
Does everybody have it?

For a use like this where your delimiter set is a singleton, it may
be simpler to do the usual strchrnul() or memchr() based loop.  The
attached is my attempt to do so on top of this patch.

> +
> +	for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", &url_save),
> +				   pat_tok = strtok_r(NULL, ".", &pat_save)) {
> +		if (!strcmp(pat_tok, "*"))
> +			continue; /* a simple glob matches everything */

s/glob/asterisk/

Other than that, the patch looks OK.

diff --git a/urlmatch.c b/urlmatch.c
index 53ff972a60..8dfc7fd28a 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -63,36 +63,47 @@ static int append_normalized_escapes(struct strbuf *buf,
 	return 1;
 }
 
+static const char *end_of_token(const char *s, int c, size_t n)
+{
+	const char *next = memchr(s, c, n);
+	if (!next)
+		next = s + n;
+	return next;
+}
+
 static int match_host(const struct url_info *url_info,
 		      const struct url_info *pattern_info)
 {
-	char *url = xmemdupz(url_info->url + url_info->host_off, url_info->host_len);
-	char *pat = xmemdupz(pattern_info->url + pattern_info->host_off, pattern_info->host_len);
-	char *url_tok, *pat_tok, *url_save, *pat_save;
-	int matching;
-
-	url_tok = strtok_r(url, ".", &url_save);
-	pat_tok = strtok_r(pat, ".", &pat_save);
-
-	for (; url_tok && pat_tok; url_tok = strtok_r(NULL, ".", &url_save),
-				   pat_tok = strtok_r(NULL, ".", &pat_save)) {
-		if (!strcmp(pat_tok, "*"))
-			continue; /* a simple glob matches everything */
-
-		if (strcmp(url_tok, pat_tok)) {
-			/* subdomains do not match */
-			matching = 0;
-			break;
-		}
+	const char *url = url_info->url + url_info->host_off;
+	const char *pat = pattern_info->url + pattern_info->host_off;
+	int url_len = url_info->host_len;
+	int pat_len = pattern_info->host_len;
+
+	while (url_len && pat_len) {
+		const char *url_next = end_of_token(url, '.', url_len);
+		const char *pat_next = end_of_token(pat, '.', pat_len);
+
+		if (pat_next == pat + 1 && pat[0] == '*')
+			/* wildcard matches anything */
+			;
+		else if ((pat_next - pat) == (url_next - url) &&
+			 !memcmp(url, pat, url_next - url))
+			/* the components are the same */
+			;
+		else
+			return 0; /* found an unmatch */
+
+		if (url_next < url + url_len)
+			url_next++;
+		url_len -= url_next - url;
+		url = url_next;
+		if (pat_next < pat + pat_len)
+			pat_next++;
+		pat_len -= pat_next - pat;
+		pat = pat_next;
 	}
 
-	/* matching if both URL and pattern are at their ends */
-	matching = (url_tok == NULL && pat_tok == NULL);
-
-	free(url);
-	free(pat);
-
-	return matching;
+	return 1;
 }
 
 static char *url_normalize_1(const char *url, struct url_info *out_info, char allow_globs)