On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget <gitgitgadget@xxxxxxxxx> wrote: > > This RFC includes a potential direction to make the sparse-checkout more > user-friendly. While there, I also present a way to use a limited set of > patterns to gain a significant performance boost in very large repositories. > > Sparse-checkout is only documented as a subsection of the read-tree docs > [1], which makes the feature hard to discover. Users have trouble navigating > the feature, especially at clone time [2], and have even resorted to > creating their own helper tools [3]. Ooh, intriguing. Count me as another person who has resorted to making my own helper tool for others to use (specific to our internal repository, though, as it also figures out inter-module dependencies to allow specifying only a few modules of interest while still checking out everything needed to build those; but it'd be nice to need less scripting to handle the git-related bits to actually sparsify or densify). > This RFC attempts to solve these problems using a new builtin. Here is a > sample workflow to give a feeling for how it can work: > > In an existing repo: > > $ git sparse-checkout init > $ ls > myFile1.txt myFile2.txt > > $ git sparse-checkout add > /myFolder/* > ^D > $ ls > myFile1.txt myFile2.txt myFolder > $ ls myFolder > a.c a.h > $ git sparse-checkout disable > $ ls > hiddenFolder myFile1.txt myFile2.txt myFolder > > At clone time: > > $ git clone --sparse origin repo > $ cd repo > $ ls > myFile1.txt myFile2.txt > $ git sparse-checkout add > /myFolder/* > ^D > $ ls > myFile1.txt myFile2.txt myFolder > > Here are some more specific details: > > * git sparse-checkout init enables core.sparseCheckout and populates the > sparse-checkout file with patterns that match only the files at root. Does it enable core.sparseCheckout in the current worktree, or for all worktrees? Do we require extensions.worktreeConfig to be set to true first? If we don't require extensions.worktreeConfig to be set to true, and users add worktrees later, do they encounter negative surprises (immediately or later)? worktrees in combination with sparseCheckouts were a headache here until I just forced people to manually first set extensions.worktreeConfig to true before using my 'sparsify' script, regardless of whether the user was currently using worktrees. That fixed the issues, but having to provide a long error message and explanation of why I wanted users to set some special config first was slightly annoying. I wonder if 'git worktree' and maybe even 'git config' should themselves have special handling for core.sparseCheckouts, because it can be a real mess otherwise. > * git clone learns the --sparse argument to run git sparse-checkout init > before the first checkout. Nice. > * git sparse-checkout add reads patterns from stdin, one per line, then > adds them to the sparse-checkout file and refreshes the working > directory. The default of reading from stdin seems a bit unusual to me, and I worry about having to explain that to users. I'd rather the add command took positional parameters (anything that doesn't start with a hyphen) and added those, e.g. $ git sparse-checkout add '/myFolder/*' ' with the option of the user specifying --stdin. > * git sparse-checkout disable removes the patterns from the sparse-checkout > file, disables core.sparseCheckout, and refills the working directory. Does it leave an empty sparse-checkout file around? Also, what if users have several paths defining their sparse pattern, and want to temporarily get a full checkout and then come back -- do they need to re-specify all the paths? (Maybe this *is* the route we want to go; I'm just trying to mention any possible negative effects we _might_ run into so we can consider them. It's not quite as relevant in my case since people specify a few toplevel modules and sparse-checkout gets several entries auto-generated for them.) Also, I'm particularly worried that a user with multiple worktrees, both sparse, could run 'git sparse-checkout disable' in one and then find that when they return to the other worktree they get a variety of nasty surprises (e.g. accidental staging or committing of the deletion of a huge number of files, random weird errors, or gradual and weird de-sparsifying as various git commands are run). This, of course, can be averted by making sure core.sparseCheckout is set on a per-worktree basis, but that seems to be something people only do after running into problems several times unless some kind of tooling enforces it. > * git sparse-checkout list lists the contents of the sparse-checkout file. > > > > The documentation for the sparse-checkout feature can now live primarily > with the git-sparse-checkout documentation. Yaay! > Cone Mode > ========= > > What really got me interested in this area is a performance problem. If we > have N patterns in the sparse-checkout file and M entries in the index, then > we can perform up to O(N * M) pattern checks in clear_ce_flags(). This > quadratic growth is not sustainable in a repo with 1,000+ patterns and > 1,000,000+ index entries. This has worried me for a while, even if it hasn't yet caused us issues in practice. > To solve this problem, I propose a new, more restrictive mode to > sparse-checkout: "cone mode". In this mode, all patterns are based on prefix > matches at a directory level. This can then use hashsets for fast > performance -- O(M) instead of O(N*M). My hashset implementation is based on > the virtual filesystem hook in the VFS for Git custom code [4]. Sweet! > In cone mode, a user specifies a list of folders which the user wants every > file inside. In addition, the cone adds all blobs that are siblings of the > folders in the directory path to that folder. This makes the directories > look "hydrated" as a user drills down to those recursively-closed folders. > These directories are called "parent" folders, as a file matches them only > if the file's immediate parent is that directory. > > When building a prototype of this feature, I used a separate file to contain > the list of recursively-closed folders and built the hashsets dynamically > based on that file. In this implementation, I tried to maximize the amount > of backwards-compatibility by storing all data in the sparse-checkout file > using patterns recognized by earlier Git versions. > > For example, if we add A/B/C as a recursive folder, then we add the > following patterns to the sparse-checkout file: > > /* > !/*/* > /A/* > !/A/*/* > /A/B/* > !/A/B/*/* > /A/B/C/* > > The alternating positive/negative patterns say "include everything in this > folder, but exclude everything another level deeper". The final pattern has > no matching negation, so is a recursively closed pattern. Oh, um, would there be any option for fast but without grabbing sibling and parent files of requested directories? And could users still request individual files (not with regex or pathspec, but fully specifying the path) and still get the fast mode? Basically, our sparse usage is exclusively specifying leading directories or full pathnames of individual files, but we really want the repo to feel smaller and make sure people notice at a glance. We have a huge 'modules/' directory, and want people to be able to get just 15 of the 500 or so subdirectories that would appear in that directory with a non-sparse checkout. And similarly we want to be able to grab just one or two files from a directory of many files. > Note that I have some basic warnings to try and check that the > sparse-checkout file doesn't match what would be written by a cone-mode add. > In such a case, Git writes a warning to stderr and continues with the old > pattern matching algorithm. These checks are currently very barebones, and > would need to be updated with more robust checks for things like regex > characters in the middle of the pattern. As review moves forward (and if we > don't change the data storage) then we could spend more time on this. Instead of trying to validate the sparse-checkout file everytime, perhaps we want to change core.sparseCheckout from a boolean to a tri-state or something where it specifies how to parse the sparse-checkout file? Or maybe when special directive (some form of comment-looking line) appears at the top of sparse-checkout then we use the hashsets speedup while disallowing general regexes and pathspecs other than leading directories and full pathnames? I'm not sure if that makes sense or not; just throwing it out there as an idea. > Derrick Stolee (8): > sparse-checkout: create builtin with 'list' subcommand > sparse-checkout: create 'init' subcommand > clone: add --sparse mode > sparse-checkout: 'add' subcommand > sparse-checkout: create 'disable' subcommand > sparse-checkout: add 'cone' mode > sparse-checkout: use hashmaps for cone patterns > sparse-checkout: init and add in cone mode > > Jeff Hostetler (1): > trace2:experiment: clear_ce_flags_1 I'll try to get some time to look over these patches in the next few days.