On 1/30/2020 10:25 AM, Finn Bryant wrote: > Hi, Hello! Thanks for chiming in with feedback. > With cone mode enabled for a sparse checkout, a pattern like the > following is accepted: > > /* > !/*/ > /a_file_or_folder/ As you mention below, you say this path is a file OR a directory. However, the trailing slash means _only_ match directories with this name. This is a specific choice made in creating cone mode to restrict only by directories. > My suspicion is that cone mode is supposed to be a pure subset of full > pattern matching, such that if cone mode is ever disabled, the > sparseness of the worktree will be unchanged. Clearly, this scenario > is breaking that pattern. By "pure subset" you must mean "strict subset". Cone mode is definitely a smaller set of patterns. > I think the correct behaviour is that recursive matches for a > potential directory do not have any effect on a non-directory file > with the same name. Alternatively, you could forbid any patterns which > match non-directory files instead (and downgrade to full pattern > matching), though I'd not be a fan of this, since it'd mean the > validity of a cone-mode sparse configuration file is dependent on the > contents of the repo, and is thus much harder to ascertain (scripts > can't simply prove if it's a valid cone mode file by parsing it, > they'd need to have access to any repo it may be applied to, and > inspect the type of any matching file/folder paths, and its validity > could be changed simply by replacing a folder with a file in the > repo). > > If matching behaviour with full pattern mode is a non-goal for cone > mode, I'd still question the logic of this behaviour, though I suppose > it does have the benefit of (accidentally?) adding support for > excluding individual files from a sparse checkout, which I imagine > some could find useful. Personally I'd prefer that was instead added > with a more sane syntax, if needed. Cone mode is specifically designed for performance using a hashset-based pattern matching. This naturally restricts the possible patterns since we cannot match the full pattern set [1] this quickly. This means there are some trade-offs that are required, but they were made with some information of real-world repositories organized in a way that they could take advantage of this pattern set. [1] https://git-scm.com/docs/gitignore#_pattern_format > $ git sparse-checkout init --cone > $ git read-tree -mu HEAD > $ ls -1 > a_file_or_folder > some_file > $ git sparse-checkout set a_file_or_folder > $ git read-tree -mu HEAD > $ ls -1 > some_file > $ cat .git/info/sparse-checkout > /* > !/*/ > /a_file_or_folder/ This is an interesting test, because I would expect the /a_file_or_folder/ pattern to not cause the _file_ to not match. It does match the first two patterns, and just because it doesn't match the third pattern shouldn't remove it. This is actually a bug in the hashset-based pattern-matching algorithm, since setting core.sparseCheckoutCone=false does not have this behavior. I'll make a patch to fix this. > Right now I'm trying to integrate cone mode with my company's existing > manifest infrastructure, which doesn't differentiate between files and > folders, so this is forcing me to perform a lot of repo checks to > confirm we aren't accidentally excluding files we were supposed to > include. Just in case you needed another example of why this behaviour > is unhelpful. I apologize that this wrong behavior is causing you some issues. If you are able to identify which paths are files instead of directories, then you can remove the filename from the path to include its parent directory. This is only a workaround until the bug is fixed, but the end-result will be the same: you'll have all sibling files in your working directory as well as the files you specified. This is important: if you have a directory with many large files, but intend to include only a subset of those files, then you will be disappointed by the size of your working directory. This requirement of specifying directories instead of files is part of the cone mode design for two reasons: one philosophical and another more practical: Philisophical: Filenames change more often than directories. That is, as users are interacting with your repo, the overall directory structure at a high level is typically static. If new cross-component dependencies are introduced, then those are usually big architecture-level changes. However, at an individual file level dependencies change at a higher rate, causing users to react by changing their sparse-checkout definition more frequently. Practical: How do we specify if an input is a directory or a file? As you mentioned, if you run "git sparse-checkout set X" then you get a sparse-checkout file containing: /* !/*/ /X/ But if X is actually a _file_, then we should write this pattern: /* !/*/ /X So the input does not provide enough information to say which pattern should be written. The other concern about this second set of patterns is that "/X" is only a _prefix_ match, not an exact path name match. That means we cannot use the existing hashset matching to result in the same pattern matching as non-cone-mode. Now, there is perhaps a way to decide between the two: inspect the current index or tree for whether the input corresponds to a directory or a file. This adds a layer of complexity that is not currently in the sparse-checkout builtin, but it is not insurmountable. The only thing to consider is what happens when the input path is not in the index/tree at all! It may be that we want to specify the patterns before moving HEAD to a commit that _does_ contain the path, and then we would do the wrong thing at this point. So I'll put out a question to the community: is this practical issue something worth overcoming? Is my concern about a non-existent path a true problem, or something more? Thanks, -Stolee