On 9/17/2022 9:24 PM, Elijah Newren wrote: > On Fri, Sep 16, 2022 at 8:34 PM Shaoxuan Yuan <shaoxuan.yuan02@xxxxxxxxx> wrote: >> >>> I'm also curious whether there shouldn't be a config option for >>> something like this, so folks don't have to specify it with every >>> invocation. In particular, while I certainly have users that want to >>> just query git for information about the part of the history they are >>> interested in, there are other users who are fully aware they are >>> working in a bigger repository and want to search for additional >>> things to add to their sparse-checkout and predominantly use grep for >>> things like that. They have even documented that `git grep --cached >>> <TERM>` can be used in sparse-checkouts for this purpose...and have >>> been using that for a few years. (I did warn them at the time that >>> there was a risk they'd have to change their command, but it's still >>> going to be a behavioral change they might not expect.) Further, when >>> I brought up changing the behavior of commands during sparse-checkouts >>> to limit to files matching the sparsity paths in that old thread at >>> [1], Stolee was a bit skeptical of making that the default. That >>> suggests, at least, that two independent groups of users would want to >>> use the non-sparse searching frequently, and frequently enough that >>> they'd appreciate a config option. >> >> A config option sounds good. Though I think >> >> 1. If this option is for global behavior: users may better off turning >> off sparse-checkout if they want a config to do things densely everywhere. > > Sorry, it sounds like I haven't explained the usecases to you very > well. Let me try again. > > There are people who want to do everything densely, as you say, and > those folks can just turn off sparse-checkout or not use it in the > first place. Git has traditionally catered to these folks just fine. > However, it's not a subset of interest for this discussion and wasn't > what I was talking about. OK, reading... > There are (at least) two different usecases for people wanting to use > sparse-checkouts; I have users that fall under each category: > > > 1) Working on a repository subset; users are _only_ interested in that subset. > > This usecase is very poorly supported in Git right now, but I think > you understand it so I'll only briefly describe it. > > These folks might know there are other things in the repository, but > don't care. Not only should the working tree be sparse, but grep, > log, diff, etc. should be restricted to the subset of the tree they > are interested in. Right, this is the usecase I am familiar with. > Restricting operations to the sparsity specification is also important > for marrying partial clones with sparse checkouts while allowing > disconnected development. Without such a restrict-to-sparsity-paths > feature, the partial clones will attempt to download objects the first > time they try to grep an old revision, or do log with a glob path. > The download will fail, causing the operation to fail, and break the > ability of the user to work in a disconnected manner. OK, I'm still learning about partial clone, didn't get a chance to look at it. Will try to figure out what this means :) > 2) The working directory is sparse, but users are working in a larger whole. > > Stolee described this usecase this way[2]: > > "I'm also focused on users that know that they are a part of a larger > whole. They know they are operating on a large repository but focus on > what they need to contribute their part. I expect multiple "roles" to > use very different, almost disjoint parts of the codebase. Some other > "architect" users operate across the entire tree or hop between different > sections of the codebase as necessary. In this situation, I'm wary of > scoping too many features to the sparse-checkout definition, especially > "git log," as it can be too confusing to have their view of the codebase > depend on your "point of view." > > [2] https://lore.kernel.org/git/1a1e33f6-3514-9afc-0a28-5a6b85bd8014@xxxxxxxxx/ > > I describe it very similarly, but I'd like to point out something > additional around this usecase and how it can be influenced by > dependencies. The first cut for sparse-checkouts is usually the > directories you are interested in plus what those directories depend > upon within your repository. But there's a monkey wrench here: if you > have integration tests, they invert the hierarchy: to run integration > tests, you need not only what you are interested in and its > dependencies, you also need everything that depends upon what you are > interested in or that depends upon one of your dependencies...AND you > need all the dependencies of that expanded group. That can easily > change your sparse-checkout into a nearly dense one. Naturally, that > tends to kill the benefits of sparse-checkouts. There are a couple > solutions to this conundrum: either avoid grabbing dependencies (maybe > have built versions of your dependencies pulled from a CI cache > somewhere), or say that users shouldn't run integration tests directly > and instead do it on the CI server when they submit a code review. Or > do both. Regardless of whether you stub out your dependencies or stub > out the things that depend upon you, there is certainly a reason to > want to query and be aware of those other parts of the repository. > Thus, sparse-checkouts can be used to limit what you directly build > and modify, but these users do not want it to limit their queries of > history. > > > Once users pick either the first or the second usecase, they often > stick within it. For either group, regardless of what Git's default > is, needing to specify an additional flag for *every* > grep/log/diff/etc. they run would just be a total annoyance. Neither > wants a dense worktree, but one side wants a dense history query while > the other wants a sparse one. Different groups should be able to > configure the default that works well for them, much like we allow > users to configure whether they want "git pull" to rebase or merge. OK, now I get it: Case A: users only interested in a subset, so they need only sparse history and a sparse worktree. v.s. Case B: users works within a subset but needs a larger context, so they need a dense history/query (that's why we should let grep default to --no-restrict, as you suggested?), though still a sparse worktree. > >> 2. If this option is for a single subcommand (e.g. 'grep'): I don't have >> much thoughts here. It certainly can be nice for users who need to do >> non-sparse searching frequently. This design, if necessary, should >> belong to a patch where this config is added for every single subcommand? >> >>> I also brought up in that old thread that perhaps we want to avoid >>> adding a flag to every subcommand, and instead just having a >>> git-global flag for triggering this type of behavior. (e.g. `git >>> --no-restrict grep --cached ...` or `git --dense grep --cached ...`). >> >> This looks more like the answer to me. It's a peace of mind for users if >> they don't have to worry about whether a subcommand is sparse-aware, and >> how may their behaviors differ. Though we still may need to update the >> actual behavior in each subcommand over an extended period of time >> (though may not be difficult?), which you mentioned above "seems like a >> poor strategy". >> >>> [1] https://lore.kernel.org/git/CABPp-BGJ_Nvi5TmgriD9Bh6eNXE2EDq2f8e8QKXAeYG3BxZafA@xxxxxxxxxxxxxx/ >>> and the responses to that email> >>>> Change the default behavior of 'git grep' to focus on the files within >>>> the sparse-checkout definition. To enable the previous behavior, add a >>>> '--sparse' option to 'git grep' that triggers the old behavior that >>>> inspects paths outside of the sparse-checkout definition when paired >>>> with the '--cached' option. >>> >>> I still think the flag name of `--sparse` is totally backwards and >>> highly confusing for the described behavior. I missed Stolee's email >>> at the time (wasn't cc'ed) where he brought up that "--sparse" had >>> already been added to "git-add" and "git-rm", but in those cases the >>> commands aren't querying and I just don't see how they lead to the >>> same level of user confusion. This one seems glaringly wrong to me >>> and both Junio and I flagged it on v1 when we first saw it. (Perhaps >>> it also helps that for the add/rm cases, that a user is often given an >>> error message with the suggested flag to use, which just doesn't make >>> sense here either.) If there is concern that this flag should be the >>> same as add and rm, then I think we need to do the backward >>> compatibility dance and fix add and rm by adding an alias over there >>> so that grep's flag won't be so confusing. >> >> I guess I'm using "--sparse" here because "add", "rm" and "mv" all imply >> that "when operating on a sparse path, ignores/warns unless '--sparse' >> is used". I take it as an analogy so "when searching a sparse path, >> ignores/warns unless '--sparse' is used". As the idea that "Git does >> *not* care sparse contents unless '--[no-]sparse' is specified" is sort >> of established through the implementations in "add", "rm", or "mv", I >> don't see a big problem using "--sparse" here. > > Well, I do. > > In addition to just being utterly backwards and confusing in the > context of grep: > * Both `clone` and `ls-files` use `--sparse` to mean to limit things > to the sparsity cone, so we're already kinda split-brained. Agree. > * grep is more like ls-files (both being querying functions) than > add/rm/mv, so should really follow its lead instead of the one from > add/rm/mv. Agree. > * There's another way to interpret `--sparse` for `add` and `rm` > such that it makes sense (at least to me); see my other email to Junio > in this thread. According to the spirit of your points, I think they should be defaulting to --restrict (a rename perhaps) right now. > * `mv` is indeed using it backward, but the `mv` change is new to > this cycle (and undocumented) so I'm not sure it counts as much of a > precedent yet. Oops, I was making the modifications to `mv` and forgot to add documentation to it. Though the --sparse of `mv` was not documented before I touching it. Perhaps it can be added later if we are going to rename --sparse/--dense to --restrict/--no-restrict. >> I *think*, as long as the users are informed that the default is to >> ignore things outside of the sparse-checkout definition, and they have >> to do something (using "--sparse" or a potential better name) to >> override the default, we are safe to use a name that is famous (i.e. >> "--sparse") even though its literal meaning is not perfectly descriptive. >> >> One outlier I do find confusing though, is the "--sparse" option from >> "git-ls-files". Without it, Git expands the index and show everything >> outside of sparse-checkout definition, which seems a bit controversial. > > Nah, that perfectly matches the expectation of users in the second > usecase above -- querying (ls-files/grep/log/diff) defaults to > non-restricted history, modifying (add/rm/mv) defaults to restricted > paths but warns if the arguments could have matched something else, > and the working tree is restricted to sparse paths. It doesn't seem > too controversial to me, even if it's not what we want for the > long-term default. OK. After the reasoning you gave above, now the --sparse of ls-files looks good. > > The defaults for the first usecase would be defaulting to restricted > paths for everything, and perhaps not warn if arguments to a modifying > command could have matched something else. > > > Anyway, hope that helps you understand my perspective and framing. Thanks for the explanations, now I get it and agree with your points :) Thanks, Shaoxuan