On Thu, Oct 6, 2022 at 11:27 AM Derrick Stolee <derrickstolee@xxxxxxxxxx> wrote: > > On 10/6/22 3:10 AM, Elijah Newren wrote: > > On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@xxxxxxxxxx> wrote: > >> > >> On 9/28/22 1:38 AM, Elijah Newren wrote: [...] > >> I'm specifically talking about 'git log'. I think that having that be > >> in a restricted mode is extremely dangerous and will only confuse users. > >> This includes 'git show' (with commit arguments) and 'git bisect', I > >> think. > > > > Thanks, that helps me understand your position better. > > > > I'm curious if, due to the length of the document and this thread, > > you're just skimming past the idea I mentioned of showing a warning at > > the beginning of `diff`, `log`, or `show` output when restricting > > based on config or defaults. Without such a warning, I agree that > > restricting might be confusing at times, but I think such a warning > > may be sufficient to address the concerns around partial/incomplete > > results. The one command that this warning idea doesn't help with is > > `grep` since it cannot safely be applied there, which potentially > > leaves `grep` giving confusing results when users pass either > > `--cached` or revisions, but you seem to not be concerned about that. > > I'm not convinced that warnings are enough for some cases I'm not sure I'm following. You suggested earlier in this thread that we may want to provide a mode where commands "don't just fail if you can't get new objects, instead inform me that the results are incomplete". You re-emphasized that in your most recent email by saying "To be _absolutely sure_ that on-demand downloads don't happen, we need an extra mode for Git and new ways of reporting partial results." So it sounds like you're suggesting a mode where partial results are a forced option, because how else can you be "_absolutely sure_ that on-demand downloads don't happen"? And if we always want to allow partial results, don't you need to inform users about those results being potentially incomplete? How exactly does one inform the user that results are incomplete if not by a warning? Something seems inconsistent here, but perhaps I'm just misunderstanding something? I think, based on what you said below, that you're uncomfortable with certain types of incompleteness, such as partial revision results, but are fine with others such as those dealing with partial blob results (whether in breadth or in depth). But if so, I'm still not sure what your statement about warnings means. If we scope operations down to the sparsity paths (e.g. potentially giving a partial-breadth diff for "git diff REV1 REV2"), what's your expectation with regards to warnings? >, especially > for output that is fed to a pager. Do the warnings stick around in > the pager? I'm not sure. If the warning is printed on stdout, then yes the warning will stick around in a pager. If the warning is printed on stderr, then the warning is likely of dubious utility since it can easily get lost. Since log & diff output are not adversely affected by additional preliminary output, I think stdout is where such a warning should go (unless folks feel like we don't even need a warning?). However, grep would be strongly negatively affected by additional output, and that's why I've stated several times that warnings cannot reasonably be included with grep. But, so far, no one has expressed concern with providing partial results for grep even if no warning can be given, so perhaps it doesn't matter. > > I'm also curious if the problem partially stems from the fact that > > with `git log` there is no way to control revision limiting and diff > > generation paths independently. If there was a way to make `git log > > -p` continue showing the regular list of commits but restrict which > > paths were shown in the diffs, and we made the --scope-sparse handling > > do this so that only diffs were limited but not the revisions > > traversed/printed, would that help address your concerns? > > My biggest issue is with the idea of simplifying the commit history > based on the sparse-checkout path definitions. The '-p' option having > a diff scoped to the sparse-checkout paths would be fine. Wahoo! Sounds like we have a path forward then. I'll update the document in my patch to reflect this distinction. Note that it's not just the -p option to log, though, but anything related to patches: diff formatting, diff filtering, rename & copy detection, and pickaxe-related options. The one place where the scoping to sparse-checkouts is slightly funny for `git log` is with --remerge-diff (because the merge machinery ignores sparsity patterns when generating the new toplevel tree; however after the new toplevel tree is generated, we would generate a diff that is limited to the sparsity patterns). [...] > > Also, blame incorporates a component of changes from the worktree, but > > it's mostly about history (and one or more -C's make it check other > > paths as well). > > Since each input is a specific file path, I'm not sure we need > anything here except perhaps a warning that they are requesting > a file outside the sparse-checkout definition (if even that). Your statement seems to suggest you are assuming that git blame will only operate on the path listed on the command line. Am I reading your assumption correctly, or am I totally misunderstanding why you would claim nothing is needed beyond a warning about the path the user typed? If I'm understanding your assumption correctly, your assumption does not hold when one or more -C options are passed. Since my earlier mentions of those options and their ramification didn't connect, perhaps it would help if I was a bit more explicit about what I mean. Let's take a simple example, in git.git, which you can run right now: git blame -C -C cache.h This command will show lines of text that now appear in cache.h but which came *from* all of these files: * builtin/clean.c * cache.h * merge-recursive.h * notes.c * object-file.c * object.h * read-cache.c * setup.c * sha1-file.c * sha1_file.c * sha1_name.c * show-diff.c * symlinks.c * tree-walk.h In order to find out and report that the current lines of cache.h came from these other files, blame has to search a wide range of other files in the repository. That potential wide range of other files in the repository is something we could consider tailoring when in a sparse-checkout, at least for Behavior A folks. [...] > I think the following things are true: > > 1. It's really important to keep the current partial clone default of > only downloading blobs on-demand. Even with a limited sparse-checkout, > it's rare that users will need every version of every file in that > sparse-checkout, and they may not want that tax on their local storage. I do agree we need to keep these in mind for some usecases, but I do not agree these are universally true among sparse-checkout users. However, our differences on this probably don't matter in practice since you then immediately suggested... > 2. Adding an opt-in backfill for a sparse-checkout definition will > prevent most on-demand downloads (although it might want to be > integrated into 'git fetch' behind an option to be really sure that > state continues in the future). Yes, this would be great. One question, though: integrated with `fetch` or with `sparse-checkout set|add`? If users adjust their sparse-checkout definition, that might be a good time to allow them to automatically trigger fixing the missing backfill at the same time. > 3. Updating Git features to scope down to sparse-checkout will prevent > many of the remaining on-demand downloads. Yes, though I'd clarify "scope down to sparse-checkout where it can make sense". Things like merge & bundle have to pay attention to changes outside the sparse-checkout, but we can get commands like diff/log -p/grep to scope down in breadth. > 4. To be _absolutely sure_ that on-demand downloads don't happen, we > need an extra mode for Git and new ways of reporting partial results. > Without this mode, Git commands fail when triggering an on-demand > download and the network is unavailable. While many commands might be able to produce partial results realistically, I think things like merge & bundle should not support such a mode and just fail if they are missing any data they normally need. Basically, we'd still have commands that would fail without a network connection beyond push/pull/fetch, but this mode would limit the list as much as possible through allowing commands to limit both breadth and depth of the blobs we act upon. > So, I'm saying that (4) is a direction that we could go. It also seems > extremely difficult to do, so we should do (2) & (3) first, which will > get us 99% of the way there. Agreed on all three counts.