Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions

Elijah Newren <newren@xxxxxxxxx> · Tue, 27 Sep 2022 22:38:14 -0700

On Tue, Sep 27, 2022 at 9:36 AM Derrick Stolee <derrickstolee@xxxxxxxxxx> wrote:
>
> On 9/24/2022 8:09 PM, Elijah Newren via GitGitGadget wrote:
> > From: Elijah Newren <newren@xxxxxxxxx>
>
> > +  (Behavior A) Users are _only_ interested in the sparse portion of the repo
> > +
> > +These folks might know there are other things in the repository, but
> > +don't care.  They are uninterested in other parts of the repository, and
> > +only want to know about changes within their area of interest.  Showing
> > +them other results from history (e.g. from diff/log/grep/etc.) is a
> > +usability annoyance, potentially a huge one since other changes in
> > +history may dwarf the changes they are interested in.
>
> This idea of restricting the commit history to the sparse-checkout
> definition (by default, with an escape hatch) seems like the most
> radical of the things we've considered. I think it's interesting to
> consider, but it might be better to think about things like diffstats,
> grepping, and otherwise preventing out-of-cone adjustments by default.
>
> That said, the idea of restricting history is also the simplest to
> describe as a user-visible change.

By "restricting commit history", are you thinking in terms of "git log
-- PATHS" or more like some kind of special --filter to git-clone?

I get the feeling you might be thinking about the latter, whereas I
was assuming users had all commits (and all trees), but log/diff would
restrict output based on relevant paths.

> > +Some of these users also arrive at this usecase from wanting to use
> > +partial clones together with sparse checkouts and do disconnected
> > +development.  Not only do these users generally not care about other
> > +parts of the repository, but consider it a blocker for Git commands to
> > +try to operate on those.  If commands attempt to access paths in history
> > +outside the sparsity specification, then the partial clone will attempt
> > +to download additional blobs on demand, fail, and then fail the user's
> > +command.  (This may be unavoidable in some cases, e.g. when `git merge`
> > +has non-trivial changes to reconcile outside the sparsity path, but we
> > +should limit how often users are forced to connect to the network.)
>
> This idea pairs well with a feature I've been meaning to build:
> 'git sparse-checkout backfill' would download all historical blobs
> within the sparse-checkout definition. This is possible with rev-list,
> but I want to investigate grouping blobs by path and making requests in
> batches, hopefully allowing better deltification and ability to recover
> from network disconnections. That makes this idea of "staying within
> your sparse-checkout means no missing object downloads" even more likely.

This sounds awesome.

> > +  (Behavior B) Users want a sparse working tree, but are working in a larger whole
> > +
> > +Stolee described this usecase this way[11]:
> > +
> > +"I'm also focused on users that know that they are a part of a larger
> > +whole. They know they are operating on a large repository but focus on
> > +what they need to contribute their part. I expect multiple "roles" to
> > +use very different, almost disjoint parts of the codebase. Some other
> > +"architect" users operate across the entire tree or hop between different
> > +sections of the codebase as necessary. In this situation, I'm wary of
> > +scoping too many features to the sparse-checkout definition, especially
> > +"git log," as it can be too confusing to have their view of the codebase
> > +depend on your "point of view."
>
> Thanks for including this.

I was actually worried this usecase was decreasing in priority for
you.  More on that later...

> > +People might also end up wanting behavior B due to complex inter-project
> > +dependencies.  The initial attempts to use sparse-checkouts usually
> > +involve the directories you are directly interested in plus what those
> > +directories depend upon within your repository.  But there's a monkey
> > +wrench here: if you have integration tests, they invert the hierarchy:
> > +to run integration tests, you need not only what you are interested in
> > +and its dependencies, you also need everything that depends upon what
> > +you are interested in or that depends upon one of your
> > +dependencies...AND you need all the dependencies of that expanded group.
> > +That can easily change your sparse-checkout into a nearly dense one.
>
> In my experience, the downstream dependencies are checked via builds in
> the cloud, though that doesn't help if they are source dependencies and
> you make a breaking change to an API interface. This kind of problem is
> absolutely one of system architecture and I don't know what Git can do
> other than to acknowledge it and recommend good patterns.

I was talking about (source) dependencies between
modules/projects/whatever-you-want-to-call-the-subcomponents of your
repository.  We have hundreds of modules, with various cross-module
dependencies that evolve over time.

I get the feeling from your description that your intra-repository
dependencies between modules/projects/whatever are much more static
for you than what we deal with.  (Which is a good thing; it'd be nice
if ours were more static.)

> In a properly-organized project, 95% of engineers in the project can have
> a small sparse-checkout, then 5% work on the common core that has these
> downstream dependencies and require a large sparse-checkout definition.

"In a properly-organized project"?  I'm unsure if this is an
indictment of some of the repositories I deal with in reality (and to
be fair, it might be a totally fair indictment), or if your statement
is starting to cross into "No true scotsman" territory.  ;-)

I would probably lean towards the former (we know it's more messy than
it should be), but I'm a bit puzzled that you'd just brush aside my
mention of integration tests.  We have people who want to run
integration tests locally, even when only modifying a small area of
the codebase.  These users are not doing cross-tree work, rather they
are doing cross-tree testing in conjunction with their work.  Running
such tests requires a build of the modules across the repository,
which naively would push folks into a dense checkout...and really long
local builds.  We want fast local builds, and sparse-checkouts help us
achieve that...but it does mean we have to be clever about how we
build in order to let these users run integration tests.  (And we have
to make it easy for users to discover the relevant integration tests,
and sometimes associated code components that depend on what they are
changing, which is where behavior B comes in).

> There's nothing Git can do to help those engineers that do cross-tree
> work.

I'm going to partially disagree with this, in part because of our
experience with many inter-module dependencies that evolve over time.
Folks can start on a certain module and begin refactoring.  Being
aware that their changes will affect other areas of the code, the can
do a search (e.g. "git grep --cached ..." to find cases outside their
current sparse checkout), and then selectively unsparsify to get the
relevant few dozen (or maybe even few hundred) modules added.  They
aren't switching to a dense checkout, just a less sparse one.  When
they are done, they may narrow their sparse specification again.  We
have a number of users doing cross-tree work who are using
sparse-checkouts, and who find it productive and say it still speeds
up their local build/test cycles.

So, I'd say that ensuring Git supports behavior B well in
sparse-checkouts, is something Git can do to help out both some of the
engineers doing cross-tree work, and some of the engineers that are
doing cross-tree testing.

(For full disclosure, we also have users doing cross-tree work using
regular dense checkouts and I agree there's not a lot we can do to
help them.)

> (nit: this is a good place to break up this paragraph.)

Yeah, it was kind of nice to have one paragraph per explanation of why
people might like behavior B.  But this is indeed a long paragraph.

[...]
> > +      * blame
> > +     * and annotate
>
> blame and annotate operate on a single path, so they already
> restrict within the sparse-checkout definition (unless the user
> specifies a path outside of the sparse-checkout). The only difference
> between A and B would be reporting an error if the path is outside the
> definition, right? We don't need to do anything special to simplify
> the history.

You're forgetting the possibility of one or more -C flags.  I'll note
it specifically on the line.

> > +      * show (when given commit arguments)
> > +      * log
> > +     * and variants: shortlog, gitk, show-branch, whatchanged
>
> And here is where we'd need to do that big changes for simplifying
> the history graph. Does 'rev-list' not fit here? I tend to think of
> 'log' as a formatting layer on top of 'rev-list', but maybe that is
> misguided.

Right, rev-list should probably be included here too.

> > +* Comands I don't know how to classify
>
> nit: s/Comands/Commands/

Thanks.

[...]
> > +=== Subcommand-dependent defaults ===
> > +
> > +Note that we have different defaults (for the desired behavior, not just
> > +the current implementation) depending on the command:
> > +
> > +  * Commands defaulting to --restrict:
>
> This appears to be the first mention of --restrict. Perhaps it would be
> worth declaring what --restrict, --restrict-unless-conflicts, and
> --no-restrict mean before creating this categorization?

Probably, yes.  Doing that might have even avoided some of the
confusion below...

[...]
> > +  * Commands defaulting to --restrict-unless-conflicts
> > +    * merge
> > +    * rebase
> > +    * cherry-pick
> > +    * revert
>
> In my mind, --restrict-unless-conflicts doesn't provide any value unless
> you want the --restrict mode to create an _error_ when trying to do
> something outside of the sparse-checkout cone.

Are you assuming here I was suggesting command line flags?  If so, I
apologize for my poor wording/descriptions.  At some point, I was just
noting that I was referring to behavior by the names of `--restrict`
and `--no-restrict`.  While pointing out that a strict interpretation
of the behaviors suggested by each name didn't match all commands, I
came up with names for alternate behaviors.  These names weren't meant
to become flags we'd use on the command line, despite the name that
perhaps suggests such.  Probably a really poor way to name these
behaviors; sorry about that.

Anyway, we do not want the behavior of `--restrict` for these
commands.  That would imply not providing conflicts to users for them
to resolve unless they are contained within the sparse specification,
which would clearly be broken.  We instead chose to write out files
with conflicts regardless of whether they are outside the sparse
specification.  This modified behavior I gave the name of
`--restrict-unless-conflict`, but we don't need or want an actual
command line flag for that.  I think the behavior should just remain
hardcoded into these commands.

(Note: these commands are among those that make me think
--[no-]restrict or --[un]focus or whatever might not make sense as a
git global option: `--restrict-unless-conflict` behavior is the
default for these and in fact that only sensible option, I think.  If
there's only one sensible option, no actual flag names are needed.)

> The only thing I can think about is that the diffstat might want to show
> the stats for the conflicted files, in which case that's an important
> perspective on the distinction from --restrict.

We only show the diffstat on a successful merge, so there's no
diffstat to show if there are any conflicted files.

> > +    In the case of am and apply, those commands only operate on the
> > +    working tree, so they are kind of in the same boat as stash.
> > +    Perhaps `git am` could run `git sparse-checkout reapply`
> > +    automatically afterward and move into a category more similar to
> > +    merge/rebase/cherry-pick, but it'd still be weird because it'd
> > +    vivify files besides just conflicted ones when there are conflicts.
>
> 'git am' should be able to construct the resulting commit from the patch
> without adding files outside of the sparse-checkout definition. If there

That's yet another interesting take on `git am` -- different than what
I originally had in mind, and different from what Junio suggested.  I
think both of your takes are better than what I was initially
thinking, I just wish your two approaches weren't pulling in opposite
directions.  :-)

> is a conflict, it fails in the application, anyway. I suppose you are
> writing this here because 'git am' does not play nice with sparse-checkout
> right now.

Well, as a result of this thread, we now have at least 2-3 potential
solutions we could pursue...

[...]
> > +    * add
> > +    * rm
> > +    * mv
> > +
> > +    The defaults here perhaps make sense since they are nearly --restrict, but
> > +    actually using --restrict could cause user confusion if users specify a
> > +    specific filename, so they warn by default.  That logic may sound like
> > +    --no-restrict should be the default, but that's prone to even bigger confusion:
> > +      * `git add <somefile>` if honored and outside the sparse cone, can result in
> > +     the file randomly disappearing later when some subsequent command is run
> > +     (since various commands automatically clean up unmodified files outside
> > +     the sparsity specification).
> > +      * `git rm '*.jpg'` could very negatively surprise users if it deletes files
> > +     outside the range of the user's interest.  Much better to operate on the
> > +     sparsity specification and give the user warnings if other files could have
> > +     matched.
>
> The cost of checking for other files that might match is sometimes too large
> (needing to expand the sparse index or walk trees to find those path names) that
> I would not recommend warning that we _didn't_ do something. Perhaps an advice
> that says "we did not look outside the sparse-checkout definition for matching
> paths" when the pathspec is not an exact path or a prefix match.

Ah, good point, and a good idea to keep in mind.

However, I think advise_on_updating_sparse_paths() currently does what
you're warning against.  Do you think there's a good chance this is
the cause of the performance bug reported over at
https://lore.kernel.org/git/CABPp-BEkJQoKZsQGCYioyga_uoDQ6iBeW+FKr8JhyuuTMK1RDw@xxxxxxxxxxxxxx
?

> > +  * Commands whose default for --restrict vs. --no-restrict should vary depending
> > +    on Behavior A or Behavior B
> > +    * diff (with --cached or REVISION arguments)
> > +    * grep (with --cached or REVISION arguments)
> > +    * show (when given commit arguments)
> > +    * bisect
> > +    * blame
> > +      * and annotate
> > +    * log
> > +      * and variants: shortlog, gitk, show-branch, whatchanged
> > +
> > +    For now, we default to behavior B for these, which want a default of
> > +    --no-restrict.
>
> I do feel pretty strongly that we'll want a --no-restrict default here
> because otherwise we will present confusion. I'm not even sure if we would
> want to make this available via a config setting, but likely a config
> setting makes sense in the long term.

You've got me slightly confused.  You did say the same thing a long time ago:

    "But I also want to avoid doing this as a default or even behind a
config setting."[A]

BUT, when Shaoxuan proposed making --restrict/--focus the default for
one of these commands, you seemed to be on board[B].

Personally, I thought that if anyone would object to some of these
commands changing, that grep would be considered as among the riskier.
For diff and log, printing a "Warning: restricting output to the
sparse-checkout specification" would be pretty innocuous, but for grep
that wouldn't be.

I was a little unsure about making `--restrict/--focus` the default
for these commands, both based on your previous concerns and because
of thinking about some of my behavior B users.  But then, it seemed
like everyone else was pushing for not only having this behavior but
making it the default[C,D,E,F].  I was beginning to wonder if even you
had decided behavior B didn't matter anymore between your support of
Shaoxuan's change at [B] and your diffstat comments at [G].  But now
it sounds like you're not only against behavior A by default but even
implementing it at all...even though I don't see how that squares with
your previous comments on grep and diffstat.

Is it just a matter of presentation?  Is it specific subcommands you
don't want changed?  Or am I either missing or misunderstanding
something?

Anyway...I will note that without a configurable option to give these
commands a behavior of `--restrict`, I think you make working in
disconnected partial clones practically impossible.  I want to be able
to do "git log -p", "git diff REV1 REV2", and "git grep TERM REV" in
disconnected partial clones, and I've wanted that kind of capability
for well over a decade[H].  So, don't be surprised if I keep bringing
up a config option of some sort for these commands.  :-)

[A] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@xxxxxxxxx/
[B] https://lore.kernel.org/git/e719d1e1-1849-07bc-ea08-2729985e5048@xxxxxxxxxx/,
and the others in the thread
[C] https://lore.kernel.org/git/2fc889c9c264fc10d878f31bd89cc44e79982516.1599758167.git.matheus.bernardino@xxxxxx/
[D] paragraphs with "transitioning" in them from
https://lore.kernel.org/git/a89413b5-464b-2d54-5b8c-4502392afde8@xxxxxxxxxx/
[E] https://lore.kernel.org/git/xmqqh719pcoo.fsf@gitster.g/
[F] https://lore.kernel.org/git/xmqqzgeqw0sy.fsf@gitster.g/
[G] https://lore.kernel.org/git/a86af661-cf58-a4e5-0214-a67d3a794d7e@xxxxxxxxxx/
[H] https://lore.kernel.org/git/1283645647-1891-1-git-send-email-newren@xxxxxxxxx/

> > +=== Implementation Questions ===
> > +
> > +  * Does the name --[no-]restrict sound good to others?  Are there better options?
> > +    * Names in use, or appearing in patches, or previously suggested:
> > +      * --sparse/--dense
> > +      * --ignore-skip-worktree-bits
> > +      * --ignore-skip-worktree-entries
> > +      * --ignore-sparsity
> > +      * --[no-]restrict-to-sparse-paths
> > +      * --full-tree/--sparse-tree
> > +      * --[no-]restrict
>
> I like the simplicity of --[no-]restrict, and my only worry is that it
> doesn't immediately link to what it is restricting.

Yeah, Junio and Victoria brought up other flavors of this same
concern, and it's also the one thing I find suboptimal about this
name.

The problem is just that we need to add the flag in more places,
"sparse" is already taken in some of them with a different meaning,
and I'm not sure there is any other flag that does automatically link
to sparse-checkouts and/or self-describe without being excessively
wordy.

> Perhaps something like "scope" would describe the set of things we care
> about, but use a text mode:
>
>         --scope=sparse  (--restrict)
>         --scope=all     (--no-restrict)
>
> But I'm notoriously bad at naming things.

Yeah, me too.  Naming things is one of the two hard problems in
computer science, right?  (The others being cache invalidation, and
off-by-one errors.)

However, in this case, your suggestion sounds pretty decent to me.
I'll add it to the list for us to consider.

> > +  * Should --[no-]restrict be a git global option, or added as options to each
> > +    relevant command?  (Does that make sense given the multitude of different
> > +    default behaviors we have for different options?)
>
> If we can make it a global option, that would be great, then update
> the commands to behave under that mode as we go.
>
> If that doesn't work, then adding the consistent option across commands
> would be helpful. It might be good to make a OPT_RESTRICT macro (much
> like OPT__VERBOSE, OPT__QUIET, and similar macros.

Ooh, I didn't know about OPT__VERBOSE and OPT__QUIET.  Thanks for the flag.

[...]
> > +  * clone: should we provide some mechanism for tying partial clones and
> > +    sparse checkouts together better.  Maybe an option
> > +     --sparse=dir1,dir2,...,dirN
> > +    which:
> > +       * Does initial fetch with `--filter=blob:none`
> > +       * Does the `sparse-checkout set --cone dir1 dir2 ... dirN` thing
> > +       * Runs a `git rev-list --objects --all -- dir1 dir2 ... dirN` to
> > +      fault in the missing blobs within the sparse
> > +      specification...except that rev-list needs some kind of options
> > +      to also get files from leading directories too.
> > +       * Sets --restrict mode to allow focusing on the cone of interest
> > +      (and to permit disconnected development)
>
> As mentioned, I think we should have the option to backfill the blobs in
> the sparse-checkout definition, but 'git clone' should not do this by
> default. It's something that can be launched in the background, maybe, but
> not a blocking operation on being able to use the repository.
>
> 'scalar clone' is an excellent testing bed for these kinds of things,
> like setting the --restrict mode by default.

Earlier in this same email you were against even making an option to
request --restrict mode, but now you're suggesting to not only
implement it but make it the default in scalar?

> Hopefully my responses aren't too far off-base. I'll go read the rest of
> the discussion now that I've contributed my thoughts on the doc.

Thanks for the detailed response!

I figured we'd have one or two places where all of us had some
disagreements on the big picture, but more and more I'm finding we
aren't even always thinking about the problems the same (e.g. the 3+
different solutions to the `am` issues).  All the more reason that a
document like this is important for us to discuss these details and
work out a plan.