On 12/31/2020 3:03 PM, Elijah Newren wrote: > Sorry for the long delay... Thanks for bringing us all back to the topic. > sparse-checkout's purpose is not fully defined. Does it exist to: > A) allow working on a subset of the repository? > B) allow working with a subset of the repository checked out? > C) something else? I think it's all of the above! My main focus for sparse-checkout is a way for users who care about a small fraction of a repository's files to only do work on those files. This saves time because they are asking Git to do less, but also they can use tools like IDEs that with "Open Folder" options without falling over. Writing fewer files also affects things like their OS indexing files for search or antivirus scanning written files. Others use sparse-checkout to remove a few large files unless they need them. I'm less interested in this case, myself. Both perspectives get better with partial clone because the download size shrinks significantly. While partial clone has a sparse-checkout style filter, it is hard to compute on the server side. Further, it is not very forgiving of someone wanting to change their sparse definition after cloning. Tree misses are really expensive, and I find that the extra network transfer of the full tree set is a price that is worth paying. I'm also focused on users that know that they are a part of a larger whole. They know they are operating on a large repository but focus on what they need to contribute their part. I expect multiple "roles" to use very different, almost disjoint parts of the codebase. Some other "architect" users operate across the entire tree or hop between different sections of the codebase as necessary. In this situation, I'm wary of scoping too many features to the sparse-checkout definition, especially "git log," as it can be too confusing to have their view of the codebase depend on your "point of view." (In case we _do_ start changing behavior in this way, I'm going to use the term "sparse parallax" to describe users being confused about their repositories because they have different sparse-checkout definitions, changing what they see from "git log" or "git diff".) > === Why it matters == > > There are unfortunately *many* gray areas when you try to define how git > subcommands should behave in sparse-checkouts. (The > implementation-level definition from a decade ago of "files are assumed > to be unchanged from HEAD when SKIP_WORKTREE is set, and we remove files > with that bit set from the working directory" definition from the past > provides no clear vision about how to resolve gray areas, and also leads > to various inconsistencies and surprises for users.) I believe a > definition based around a usecase (or usecases) for the purpose of > sparse-checkouts would remove most of the gray areas. > > Are there choices other than A & B that I proposed above that make > sense? Traditionally, I thought of B as just a partial implementation > of A, and that A was the desired end-goal. However, others have argued > for B as a preferred choice (some users at $DAYJOB even want both A and > B, meaning they'd like a simple short flag to switch between the two). > There may be others I'm unaware of. > > git implements neither A nor B. It might be nice to think of git's > current behavior as a partial implementation of B (enough to provide > some value, but still feel buggy/incomplete), and that after finishing B > we could add more work to allow A. I'm not sure if the current > implementation is just a subset of B, though. > > Let's dig in... I read your detailed message and I think you make some great points. I think there are three possible situations: 1. sparse-checkout should not affect the behavior at all. An example for this is "git commit". We want the root tree to contain all of the subtrees and blobs that are out of the sparse-checkout definition. The underlying object model should never change. 2. sparse-checkout should change the default, but users can opt-out. The examples I think of here are 'git grep' and 'git rm', as we have discussed recently. Having a default of "you already chose to be in a sparse-checkout, so we think this behavior is better for you" should continue to be pursued. 3. Users can opt-in to a sparse-checkout version of a behavior. The example in this case is "git diff". Perhaps we would want to see a diff scoped only to our sparse definition, but that should not be the default. It is too risky to change the output here without an explicit choice by the user. Let's get into your concrete details now: > === behavioral proposals === > > Short term version: > > * en/stash-apply-sparse-checkout: apply as-is. > > * mt/rm-sparse-checkout: modify it to ignore sparse.restrictCmds -- > `git rm` should be like `git add` and _always_ ignore > SKIP_WORKTREE paths, but it should print a warning (and return > with non-zero exit code) if only SKIP_WORKTREE'd paths match the > pathspec. If folks want to remove (or add) files outside current > sparsity paths, they can either update their sparsity paths or use > `git update-index`. > > * mt/grep-sparse-checkout: figure out shorter flag names. Default to > --no-restrict-to-sparse, for now. Then merge it for git-2.31. I don't want to derail your high-level conversation too much, but by the end of January I hope to send an RFC to create a "sparse index" which allows the index to store entries corresponding to a directory with the skip- worktree bit on. The biggest benefit is that commands like 'git status' and 'git add' will actually change their performance based on the size of the sparse-checkout definition and not the total number of paths at HEAD. The other thing that happens once we have that idea is that these behaviors in 'git grep' or 'git rm' actually become _easier_ to implement because we don't even have an immediate reference to the blobs outside of the sparse cone (assuming cone mode). The tricky part (that I'm continuing to work on, hence no RFC today) is enabling the part where a user can opt-in to the old behavior. This requires parsing trees to expand the index as necessary. A simple approach is to create an in-memory index that is the full expansion at HEAD, when necessary. It will be better to do expansions in a targeted way. (Your merge-ort algorithm is critical to the success here, since that doesn't use the index as a data structure. I expect to make merge-ort the default for users with a sparse index. Your algorithm will be done first.) My point in bringing this up is that perhaps we should pause concrete work on updating other builtins until we have a clearer idea of what a sparse index could look like and how the implementation would change based on having one or not. I hope that my RFC will be illuminating in this regard. Ok, enough of that sidebar. I thought it important to bring up, but > Longer term version: > > I'll split these into categories... > > --> Default behavior > * Default to behavior B (--no-restrict-to-sparse from > mt/grep-sparse-checkout) for now. I think that's the wrong default > for when we marry sparse-checkouts with partial clones, but we only > have patches for behavior A for git grep; it may take a while to > support behavior A in each command. Slowly changing behavior of > commands with each release is problematic. We can discuss again > after behavior A is fully supported what to make the defaults be. > > --> Commands already working with sparse-checkouts; no known bugs: > * status > * switch, the "switch" parts of checkout > > * read-tree > * update-index > * ls-files > > --> Enhancements > * General > * shorter flag names than --[no-]restrict-to-sparse. --dense and > --sparse? --[no-]restrict? --full-workdir? > * sparse-checkout (After behavior A is implemented...) > * Provide warning if sparse.restrictCmds not set (similar to git > pull's warning with no pull.rebase, or git checkout's warning when > detaching HEAD) > * clone > * Consider having clone set sparse.restrictCmds based on whether > --partial is provided in addition to --sparse. In general, we could use some strategies to help users opt-in to these new behaviors more easily. We are very close to having the only real feature of Scalar be that it sets these options automatically, and will continue to push to the newest improvements as possible. > --> Commands with minor bugs/annoyances: > * add > * print a warning if pathspec only matches SKIP_WORKTREE files (much > as it already does if the pathspec matches no files) > > * reset --hard > * spurious and incorrect warning when removing a newly added file > * merge, rebase, cherry-pick, revert > * unnecessary unsparsification (merge-ort should fix this) > * stash > * similar to merge, but there are extra bugs from the pipeline > design. en/stash-apply-sparse-checkout fixes the known issues. > > --> Buggy commands > * am > * should behave like merge commands -- (1) it needs to be okay for > the file to not exist in the working directory; vivify it if > necessary, (2) any conflicted paths must remain vivified, (3) > paths which merge cleanly can be unvivified. > * apply > * See am > * rm > * should behave like add, skipping SKIP_WORKTREE entries. See comments > on mt/rm-sparse-checkout elsewhere > * restore > * with revisions and/or globs, sparsity patterns should be heeded > * checkout > * see restore > > --> Commands that need no changes because commits are full-tree: > * archive > * bundle > * commit > * format-patch > * fast-export > * fast-import > * commit-tree > > --> Commands that would change for behavior A > * bisect > * Only consider commits touching paths matching sparsity patterns > * diff > * When given revisions, only show subset of files matching sparsity > patterns. If pathspecs are given, intersect them with sparsity > patterns. > * log > * Only consider commits touching at least one path matching sparsity > patterns. If pathspecs are given, paths must match both the > pathspecs and the sparsity patterns in order to be considered > relevant and be shown. > * gitk > * See log > * shortlog > * See log > * grep > * See mt/grep-sparse-checkout; it's been discussed in detail..and is > implemented. (Other than that we don't want behavior A to be the > default when so many commands do not support it yet.) > > * show-branch > * See log > * whatchanged > * See log > * show (at least for commits) > * See diff > > * blame > * With -C or -C -C, only detect lines moved/copied from files that match > the sparsity paths. > * annotate > * See blame. this "behavior A" idea is the one I'm most skeptical about. Creating a way to opt-in to a sparse definition might be nice. It might be nice to run "git log --simplify-sparse" to see the simplified history when only caring about commits that changed according to the current sparse-checkout definitions. Expand that more when asking for diffs as part of that log, and the way we specify the option becomes tricky. But I also want to avoid doing this as a default or even behind a config setting. We already get enough complains about "missing commits" when someone does a bad merge so "git log -- file" simplifies away a commit that exists in the full history. Imagine someone saying "on my machine, 'git log' shows the commit, but my colleague can't see it!" I would really like to avoid adding to that confusion if possible. > --> Commands whose behavior I'm still uncertain of: > * worktree add > * for behavior A (marrying sparse-checkout with partial clone), we > should almost certainly copy sparsity paths from the previous > worktree (we either have to do that or have some kind of > specify-at-clone-time default set of sparsity paths) > * for behavior B, we may also want to copy sparsity paths from the > previous worktree (much like a new command line shell will copy > $PWD from the previous one), but it's less clear. Should it? I think 'git worktree add' should at minimum continue using a sparse- checkout if the current working directory has one. Worktrees are a great way to scale the creation of multiple working directories for the same repository without re-cloning all of the history. In a partial clone case, it's really important that we don't explode the workdir in the new worktree (or even download all those blobs). Now, should we copy the sparse-checkout definitions, or start with the "only files at root" default? That's a more subtle question. > * range-diff > * is this considered to be log-like for format-patch-like in > behavior? If we stick with log acting on the full tree unless specified in the command-line options, then range-diff can be the same. Seems like a really low priority, though, because of the proximity to format-patch. > * cherry > * see range-diff > * plumbing -- diff-files, diff-index, diff-tree, ls-tree, rev-list > * should these be tweaked or always operate full-tree? > * checkout-index > * should it be like checkout and pay attention to sparsity paths, or > be considered special like update-index, ls-files, & read-tree and > write to working tree anyway? > * mv > * I don't think mv can take a glob, and I think it currently happens to > work. Should we add a comment to the code that if anyone wants to > support mv using pathspecs they might need to be careful about > SKIP_WORKTREE? > > --> Might need changes, but who cares? > * merge-file > * merge-index > > --> Commands with no interaction with sparse-checkout: (I agree with the list you included here.) Thanks for starting the discussion. Perhaps more will pick it up as they return from the holiday break. Thanks, -Stolee