Re: [PATCH 0/9] [RFC] New sparse-checkout builtin and "cone" mode

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Aug 22, 2019 at 6:10 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 8/21/2019 5:52 PM, Elijah Newren wrote:
> > On Tue, Aug 20, 2019 at 8:12 AM Derrick Stolee via GitGitGadget
> > <gitgitgadget@xxxxxxxxx> wrote:

<snip>
> >> Here are some more specific details:
> >>
> >>  * git sparse-checkout init enables core.sparseCheckout and populates the
> >>    sparse-checkout file with patterns that match only the files at root.
> >
> > Does it enable core.sparseCheckout in the current worktree, or for all
> > worktrees?  Do we require extensions.worktreeConfig to be set to true
> > first?  If we don't require extensions.worktreeConfig to be set to
> > true, and users add worktrees later, do they encounter negative
> > surprises (immediately or later)?
>
> This is an interesting scenario that I had not considered. Thanks!
>
> My guess is that we should set `extensions.worktreeConfig=true` to
> avoid surprises. I'll need to play with this to discover the answers
> to these questions:
>
> 1. Where does the worktree look for the sparse-checkout file? Does
>    each worktree have its own sparse-checkout file? Should it?

For the main/first/primary worktree: .git/info/sparse-checkout
For all other worktrees: .git/worktrees/$WORKTREE/info/sparse-checkout

So, yes, each has its own, and from my viewpoint, absolutely yes that
is what we want.

> 2. If I have `extensions.worktreeConfig=true` and `core.sparseCheckout=true`
>    in the current worktree and run `git worktree add`, does the new worktree
>    have `core.sparseCheckout=true`? Can we `git clone --sparse` and then
>    start building sparse worktrees seamlessly?

My $0.02: I think `git worktree add` should not only adopt the setting
of core.sparseCheckout from the current worktree, but it should also
adopt the $GIT_DIR/info/sparse-checkout file too.  Granted, users can
change it to something else, but much like a new shell starts up with
the same current working directory as its parent shell, I think it'd
be most obvious for people to have a worktree that looked similar to
the one they launched it from.

<snip>
> > The default of reading from stdin seems a bit unusual to me, and I
> > worry about having to explain that to users.  I'd rather the add
> > command took positional parameters (anything that doesn't start with a
> > hyphen) and added those, e.g.
> >   $ git sparse-checkout add '/myFolder/*' '
> > with the option of the user specifying --stdin.
>
> I had the same thought, and likely that's where we should go with the
> builtin. For our needs, the input over stdin is more important for
> testing, so I built it first. I will adjust the CLI here to take a set
> of paths over the arguments unless --stdin is given.
>
> >>  * git sparse-checkout disable removes the patterns from the sparse-checkout
> >>    file, disables core.sparseCheckout, and refills the working directory.
> >
> > Does it leave an empty sparse-checkout file around?  Also, what if
> > users have several paths defining their sparse pattern, and want to
> > temporarily get a full checkout and then come back -- do they need to
> > re-specify all the paths?  (Maybe this *is* the route we want to go;
> > I'm just trying to mention any possible negative effects we _might_
> > run into so we can consider them.  It's not quite as relevant in my
> > case since people specify a few toplevel modules and sparse-checkout
> > gets several entries auto-generated for them.)
>
> In this case, there is an intermediate step (that follows the existing
> advice) to modify the sparse-checkout file to contain only "/*\n" then
> run read-tree to fill the working directory before disabling the config
> setting.
>
> Perhaps "disable" is the wrong word to use, as it makes you think that
> there should be an "enable" that can quickly toggle between the two
> modes. Maybe instead it should be "git sparse-checkout reset [empty|full]"
> where you could 'reset' the sparse-checkout to one of two initial
> states:
>
> 1. empty: only files at root are included.
> 2. full: all files are included.
>
> In each case, we would obliterate the existing sparse-checkout entries,
> but hopefully that behavior is more clear from the command names.

Will "reset" be seen as slightly less obvious wording that needs to be
explained to users?  If so, maybe have "undo" and "empty" verbs?  (Of
course, "init" already empties, both when starting from full or when
we have some kind of sparse checkout.)  I dunno, just some ideas.

> >> In cone mode, a user specifies a list of folders which the user wants every
> >> file inside. In addition, the cone adds all blobs that are siblings of the
> >> folders in the directory path to that folder. This makes the directories
> >> look "hydrated" as a user drills down to those recursively-closed folders.
> >> These directories are called "parent" folders, as a file matches them only
> >> if the file's immediate parent is that directory.
> >>
> >> When building a prototype of this feature, I used a separate file to contain
> >> the list of recursively-closed folders and built the hashsets dynamically
> >> based on that file. In this implementation, I tried to maximize the amount
> >> of backwards-compatibility by storing all data in the sparse-checkout file
> >> using patterns recognized by earlier Git versions.
> >>
> >> For example, if we add A/B/C as a recursive folder, then we add the
> >> following patterns to the sparse-checkout file:
> >>
> >> /*
> >> !/*/*
> >> /A/*
> >> !/A/*/*
> >> /A/B/*
> >> !/A/B/*/*
> >> /A/B/C/*
> >>
> >> The alternating positive/negative patterns say "include everything in this
> >> folder, but exclude everything another level deeper". The final pattern has
> >> no matching negation, so is a recursively closed pattern.
> >
> > Oh, um, would there be any option for fast but without grabbing
> > sibling and parent files of requested directories?  And could users
> > still request individual files (not with regex or pathspec, but fully
> > specifying the path) and still get the fast mode?
>
> Exact files could probably be included and still be fast. It requires an
> extra hash check per entry, but that's a small price to pay I think.
>
> With the sibling files, this is something I believe to be user-friendly:
> as a user drills down into the folder they included recursively, there may
> be helpful files along the way, like documentation, project files, etc.
>
> Here is my philosophical position here: a repo can take advantage of the
> sparse-checkout feature if it is properly componetized. Those component
> boundaries are likely at folder boundaries. Any file that exists in a parent
> folder for two components is likely important to _both_ components. If
> a file is large and is not needed by both components, it should be placed
> deeper in the tree, so it can be avoided.
>
> With that philosophy in mind, I designed this to help users fall into the
> "pit of success" when their repo is in a good shape AND to motivate users
> with repos in non-optimal shapes to reorganize.
>
> The thought I had about exact file names is similar: if there is a large
> list of files in a folder where I only need a subset, then how do I know
> if a new file is added that I need? It will not show up in the directory
> without updating the sparse-checkout. A user would discover this need by
> something going wrong when they are not interacting with version control:
> a build.
>
> This is particularly important with the root directory. We need things
> like .gitignore, .gitattributes, README, LICENSE, etc. to be populated
> by default. If there are too many files at root to reasonably work with
> the repo, then the repo should be reorganized using folders.
>
> > Basically, our sparse usage is exclusively specifying leading
> > directories or full pathnames of individual files, but we really want
> > the repo to feel smaller and make sure people notice at a glance.  We
> > have a huge 'modules/' directory, and want people to be able to get
> > just 15 of the 500 or so subdirectories that would appear in that
> > directory with a non-sparse checkout.  And similarly we want to be
> > able to grab just one or two files from a directory of many files.
>
> Your modules/ example seems to work with the feature as designed, as
> you want a set of folders one level deeper. Grabbing one or two files
> from a directory is a direction we can go with the feature, but I will
> continue to believe that should be a rare occurrence compared to including
> a folder recursively.

Oh, you're right, I was misunderstanding what it'd do.  This does look
like it's really close to what we're using, and most of the
differences are probably worth some slightly reshuffling of paths in
the repo.  Now that I've played with it some, it seems really awesome.

Being able to grab one or two files from a directory without grabbing
an entire directory and its parents I think would probably still be
useful, but I do agree that it'd be a rare occurrence.

> >> Note that I have some basic warnings to try and check that the
> >> sparse-checkout file doesn't match what would be written by a cone-mode add.
> >> In such a case, Git writes a warning to stderr and continues with the old
> >> pattern matching algorithm. These checks are currently very barebones, and
> >> would need to be updated with more robust checks for things like regex
> >> characters in the middle of the pattern. As review moves forward (and if we
> >> don't change the data storage) then we could spend more time on this.
> >
> > Instead of trying to validate the sparse-checkout file everytime,
> > perhaps we want to change core.sparseCheckout from a boolean to a
> > tri-state or something where it specifies how to parse the
> > sparse-checkout file?  Or maybe when special directive (some form of
> > comment-looking line) appears at the top of sparse-checkout then we
> > use the hashsets speedup while disallowing general regexes and
> > pathspecs other than leading directories and full pathnames?
>
> In this series, I turn `core.sparseCheckout` into a tri-state, and only
> try to validate the sparse-checkout when `core.sparseCheckout=cone`.
> This avoids spending time on the validation when someone is content using
> the existing feature.
>
> The _intent_ of using the sparse-checkout file and no extra data structure
> was to let other clients (or an older client) read the sparse-checkout data
> and result in the same working directory. One thing I realized after
> submitting is that the tri-state config variable will cause old clients
> to error on parsing the non-boolean value. Instead, in v2 I will introduce
> a new boolean config variable "core.sparseCheckoutCone" that will do the
> same thing as the current series when `core.sparseCheckout=cone` and will
> fix this compat scenario.

Once we are forced to use yet another config variable, we may as well
use yet another config file ($GITDIR/info/sparse-checkout-cone or
something; or maybe a less specific name with greater future
compatibility via some version marking in it).

One thing I noticed twice while using this series was that when I had
an existing sparse checkout it was easy to get into a weird state
where things were messed up, I think due to the fact that
"sparse-checkout init [--cone]" prefers to honor any pre-existing
$GITDIR/info/sparse-checkout file.  Once my config file was very much
not cone-compatible, and another time it was empty and caused
read-tree to error out with something like "there'd be nothing left!".
I manually twiddled with core.sparseCheckout and the sparse-checkout
file and 'git read-tree -mu HEAD' to get it fixed, but I'd rather
avoid others running into such problems.  Sorry I didn't take good
notes on it; I was just trying to get a good feel for this series.

> > I'll try to get some time to look over these patches in the next few days.
>
> I look forward to your feedback! I also have some feedback to respond to
> from my team [1], but I'm waiting to make sure the community likes the
> overall idea before jumping into code style and method organization
> details.

I think this idea is great; I'm a big fan right now.  I'm excited to
see how this will pan out.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux