Re: Sparse checkout inconsistency with non-folder paths between cone mode and full matching (2.5.0)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 1/30/2020 10:25 AM, Finn Bryant wrote:
> Hi,

Hello! Thanks for chiming in with feedback.

> With cone mode enabled for a sparse checkout, a pattern like the
> following is accepted:
> 
> /*
> !/*/
> /a_file_or_folder/

As you mention below, you say this path is a file OR a directory.
However, the trailing slash means _only_ match directories with this
name. This is a specific choice made in creating cone mode to restrict
only by directories.

> My suspicion is that cone mode is supposed to be a pure subset of full
> pattern matching, such that if cone mode is ever disabled, the
> sparseness of the worktree will be unchanged. Clearly, this scenario
> is breaking that pattern.

By "pure subset" you must mean "strict subset". Cone mode is definitely
a smaller set of patterns.

> I think the correct behaviour is that recursive matches for a
> potential directory do not have any effect on a non-directory file
> with the same name. Alternatively, you could forbid any patterns which
> match non-directory files instead (and downgrade to full pattern
> matching), though I'd not be a fan of this, since it'd mean the
> validity of a cone-mode sparse configuration file is dependent on the
> contents of the repo, and is thus much harder to ascertain (scripts
> can't simply prove if it's a valid cone mode file by parsing it,
> they'd need to have access to any repo it may be applied to, and
> inspect the type of any matching file/folder paths, and its validity
> could be changed simply by replacing a folder with a file in the
> repo).
> 
> If matching behaviour with full pattern mode is a non-goal for cone
> mode, I'd still question the logic of this behaviour, though I suppose
> it does have the benefit of (accidentally?) adding support for
> excluding individual files from a sparse checkout, which I imagine
> some could find useful. Personally I'd prefer that was instead added
> with a more sane syntax, if needed.

Cone mode is specifically designed for performance using a hashset-based
pattern matching. This naturally restricts the possible patterns since
we cannot match the full pattern set [1] this quickly. This means there
are some trade-offs that are required, but they were made with some
information of real-world repositories organized in a way that they
could take advantage of this pattern set.

[1] https://git-scm.com/docs/gitignore#_pattern_format
 
> $ git sparse-checkout init --cone
> $ git read-tree -mu HEAD
> $ ls -1
> a_file_or_folder
> some_file
> $ git sparse-checkout set a_file_or_folder
> $ git read-tree -mu HEAD
> $ ls -1
> some_file
> $ cat .git/info/sparse-checkout
> /*
> !/*/
> /a_file_or_folder/

This is an interesting test, because I would expect the /a_file_or_folder/
pattern to not cause the _file_ to not match. It does match the first two
patterns, and just because it doesn't match the third pattern shouldn't
remove it.

This is actually a bug in the hashset-based pattern-matching algorithm,
since setting core.sparseCheckoutCone=false does not have this behavior.
I'll make a patch to fix this.

> Right now I'm trying to integrate cone mode with my company's existing
> manifest infrastructure, which doesn't differentiate between files and
> folders, so this is forcing me to perform a lot of repo checks to
> confirm we aren't accidentally excluding files we were supposed to
> include. Just in case you needed another example of why this behaviour
> is unhelpful.

I apologize that this wrong behavior is causing you some issues. If you
are able to identify which paths are files instead of directories, then
you can remove the filename from the path to include its parent directory.
This is only a workaround until the bug is fixed, but the end-result will
be the same: you'll have all sibling files in your working directory as
well as the files you specified.

This is important: if you have a directory with many large files, but
intend to include only a subset of those files, then you will be
disappointed by the size of your working directory.

This requirement of specifying directories instead of files is part
of the cone mode design for two reasons: one philosophical and another
more practical:

Philisophical: Filenames change more often than directories.

That is, as users are interacting with your repo, the overall directory
structure at a high level is typically static. If new cross-component
dependencies are introduced, then those are usually big architecture-level
changes. However, at an individual file level dependencies change at
a higher rate, causing users to react by changing their sparse-checkout
definition more frequently.

Practical: How do we specify if an input is a directory or a file?

As you mentioned, if you run "git sparse-checkout set X" then you get
a sparse-checkout file containing:

	/*
	!/*/
	/X/

But if X is actually a _file_, then we should write this pattern:

	/*
	!/*/
	/X

So the input does not provide enough information to say which pattern
should be written.

The other concern about this second set of patterns is that "/X" is
only a _prefix_ match, not an exact path name match. That means we
cannot use the existing hashset matching to result in the same
pattern matching as non-cone-mode.

Now, there is perhaps a way to decide between the two: inspect the
current index or tree for whether the input corresponds to a directory
or a file. This adds a layer of complexity that is not currently in
the sparse-checkout builtin, but it is not insurmountable. The only
thing to consider is what happens when the input path is not in the
index/tree at all! It may be that we want to specify the patterns
before moving HEAD to a commit that _does_ contain the path, and then
we would do the wrong thing at this point.

So I'll put out a question to the community: is this practical issue
something worth overcoming? Is my concern about a non-existent path
a true problem, or something more?

Thanks,
-Stolee





[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux