Re: [PATCH v3 10/12] sparse-checkout: write escaped patterns in cone mode

Jeff King <peff@xxxxxxxx> · Wed, 29 Jan 2020 05:17:13 -0500

On Tue, Jan 28, 2020 at 06:26:41PM +0000, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> 
> If a user somehow creates a directory with an asterisk (*) or backslash
> (\), then the "git sparse-checkout set" command will struggle to provide
> the correct pattern in the sparse-checkout file. When not in cone mode,
> the provided pattern is written directly into the sparse-checkout file.
> However, in cone mode we expect a list of paths to directories and then
> we convert those into patterns.

Is this really about cone mode? It seems more like it is about --stdin.
I.e., what are the rules for when the input is a filename and when it is
a pattern? In our earlier discussion, I had assumed that command-line
arguments to "sparse-checkout set" were actual filenames, and "--stdin"
just read them from stdin.

But looking at the documentation, they are always called "patterns" on
the command-line. Should the "--stdin" documentation make it clear that
we are no longer taking patterns, but instead actual filenames?

Or am I confused, and in non-cone-mode the "ls-tree | sparse-checkout"
pipeline is not supposed to work at all? (I.e., they really are always
patterns)?

> Even more specifically, the goal is to always allow the following from
> the root of a repo:
> 
>   git ls-tree --name-only -d HEAD | git sparse-checkout set --stdin
> 
> The ls-tree command provides directory names with an unescaped asterisk.
> It also quotes the directories that contain an escaped backslash. We
> must remove these quotes, then keep the escaped backslashes.
> 
> However, there is some care needed for the timing of these escapes. The
> in-memory pattern list is used to update the working directory before
> writing the patterns to disk. Thus, we need the command to have the
> unescaped names in the hashsets for the cone comparisons, then escape
> the patterns later.

OK, this part make sense.

> Use unquote_c_style() when parsing lines from stdin. Command-line
> arguments will be parsed as-is, assuming the user can do the correct
> level of escaping from their environment to match the exact directory
> names.

I think there's two issues here: escaping characters from the shell so
that they make it intact to Git, and the question of whether Git is
expecting patterns or raw filenames. I agree the user is responsible for
the shell half, but I think we need to clarify what we're expecting.
I.e., if I say:

 git sparse-checkout set 'f*'

am I trying to match "foo", or the literal file "f*"?

> +static char *escaped_pattern(char *pattern)
> +{
> +	char *p = pattern;
> +	struct strbuf final = STRBUF_INIT;
> +
> +	while (*p) {
> +		if (*p == '*' || *p == '\\')
> +			strbuf_addch(&final, '\\');
> +
> +		strbuf_addch(&final, *p);
> +		p++;
> +	}
> +
> +	return strbuf_detach(&final, NULL);
> +}

Do we need to catch other metacharacters here (using is_glob_special()
perhaps)?

> @@ -423,8 +442,21 @@ static int sparse_checkout_set(int argc, const char **argv, const char *prefix)
>  		pl.use_cone_patterns = 1;
>  
>  		if (set_opts.use_stdin) {
> -			while (!strbuf_getline(&line, stdin))
> +			struct strbuf unquoted = STRBUF_INIT;
> +			while (!strbuf_getline(&line, stdin)) {
> +				if (line.buf[0] == '"') {
> +					strbuf_setlen(&unquoted, 0);

A minor nit, but strbuf_reset(&unquoted) would be more idiomatic here.

> +					if (unquote_c_style(&unquoted, line.buf, NULL))
> +						die(_("unable to unquote C-style string '%s'"),
> +						line.buf);
> +
> +					strbuf_swap(&unquoted, &line);
> +				}
> +
>  				strbuf_to_cone_pattern(&line, &pl);
> +			}

OK, overall this input procedure makes sense to me.

-Peff