Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions

Elijah Newren <newren@xxxxxxxxx> · Thu, 6 Oct 2022 19:56:11 -0700

On Thu, Oct 6, 2022 at 11:27 AM Derrick Stolee <derrickstolee@xxxxxxxxxx> wrote:
>
> On 10/6/22 3:10 AM, Elijah Newren wrote:
> > On Wed, Sep 28, 2022 at 6:22 AM Derrick Stolee <derrickstolee@xxxxxxxxxx> wrote:
> >>
> >> On 9/28/22 1:38 AM, Elijah Newren wrote:
[...]
> >> I'm specifically talking about 'git log'. I think that having that be
> >> in a restricted mode is extremely dangerous and will only confuse users.
> >> This includes 'git show' (with commit arguments) and 'git bisect', I
> >> think.
> >
> > Thanks, that helps me understand your position better.
> >
> > I'm curious if, due to the length of the document and this thread,
> > you're just skimming past the idea I mentioned of showing a warning at
> > the beginning of `diff`, `log`, or `show` output when restricting
> > based on config or defaults.  Without such a warning, I agree that
> > restricting might be confusing at times, but I think such a warning
> > may be sufficient to address the concerns around partial/incomplete
> > results.  The one command that this warning idea doesn't help with is
> > `grep` since it cannot safely be applied there, which potentially
> > leaves `grep` giving confusing results when users pass either
> > `--cached` or revisions, but you seem to not be concerned about that.
>
> I'm not convinced that warnings are enough for some cases

I'm not sure I'm following.  You suggested earlier in this thread that
we may want to provide a mode where commands "don't just fail if you
can't get new objects, instead inform me that the results are
incomplete".  You re-emphasized that in your most recent email by
saying "To be _absolutely sure_ that on-demand downloads don't happen,
we need an extra mode for Git and new ways of reporting partial
results."  So it sounds like you're suggesting a mode where partial
results are a forced option, because how else can you be "_absolutely
sure_ that on-demand downloads don't happen"?  And if we always want
to allow partial results, don't you need to inform users about those
results being potentially incomplete?  How exactly does one inform the
user that results are incomplete if not by a warning?  Something seems
inconsistent here, but perhaps I'm just misunderstanding something?

I think, based on what you said below, that you're uncomfortable with
certain types of incompleteness, such as partial revision results, but
are fine with others such as those dealing with partial blob results
(whether in breadth or in depth).  But if so, I'm still not sure what
your statement about warnings means.  If we scope operations down to
the sparsity paths (e.g. potentially giving a partial-breadth diff for
"git diff REV1 REV2"), what's your expectation with regards to
warnings?

>, especially
> for output that is fed to a pager. Do the warnings stick around in
> the pager? I'm not sure.

If the warning is printed on stdout, then yes the warning will stick
around in a pager.  If the warning is printed on stderr, then the
warning is likely of dubious utility since it can easily get lost.
Since log & diff output are not adversely affected by additional
preliminary output, I think stdout is where such a warning should go
(unless folks feel like we don't even need a warning?).  However, grep
would be strongly negatively affected by additional output, and that's
why I've stated several times that warnings cannot reasonably be
included with grep.

But, so far, no one has expressed concern with providing partial
results for grep even if no warning can be given, so perhaps it
doesn't matter.

> > I'm also curious if the problem partially stems from the fact that
> > with `git log` there is no way to control revision limiting and diff
> > generation paths independently.  If there was a way to make `git log
> > -p` continue showing the regular list of commits but restrict which
> > paths were shown in the diffs, and we made the --scope-sparse handling
> > do this so that only diffs were limited but not the revisions
> > traversed/printed, would that help address your concerns?
>
> My biggest issue is with the idea of simplifying the commit history
> based on the sparse-checkout path definitions. The '-p' option having
> a diff scoped to the sparse-checkout paths would be fine.

Wahoo!  Sounds like we have a path forward then.  I'll update the
document in my patch to reflect this distinction.

Note that it's not just the -p option to log, though, but anything
related to patches: diff formatting, diff filtering, rename & copy
detection, and pickaxe-related options.  The one place where the
scoping to sparse-checkouts is slightly funny for `git log` is with
--remerge-diff (because the merge machinery ignores sparsity patterns
when generating the new toplevel tree; however after the new toplevel
tree is generated, we would generate a diff that is limited to the
sparsity patterns).

[...]
> > Also, blame incorporates a component of changes from the worktree, but
> > it's mostly about history (and one or more -C's make it check other
> > paths as well).
>
> Since each input is a specific file path, I'm not sure we need
> anything here except perhaps a warning that they are requesting
> a file outside the sparse-checkout definition (if even that).

Your statement seems to suggest you are assuming that git blame will
only operate on the path listed on the command line.  Am I reading
your assumption correctly, or am I totally misunderstanding why you
would claim nothing is needed beyond a warning about the path the user
typed?  If I'm understanding your assumption correctly, your
assumption does not hold when one or more -C options are passed.
Since my earlier mentions of those options and their ramification
didn't connect, perhaps it would help if I was a bit more explicit
about what I mean.  Let's take a simple example, in git.git, which you
can run right now:

   git blame -C -C cache.h

This command will show lines of text that now appear in cache.h but
which came *from* all of these files:

    * builtin/clean.c
    * cache.h
    * merge-recursive.h
    * notes.c
    * object-file.c
    * object.h
    * read-cache.c
    * setup.c
    * sha1-file.c
    * sha1_file.c
    * sha1_name.c
    * show-diff.c
    * symlinks.c
    * tree-walk.h

In order to find out and report that the current lines of cache.h came
from these other files, blame has to search a wide range of other
files in the repository.  That potential wide range of other files in
the repository is something we could consider tailoring when in a
sparse-checkout, at least for Behavior A folks.

[...]
> I think the following things are true:
>
> 1. It's really important to keep the current partial clone default of
>    only downloading blobs on-demand. Even with a limited sparse-checkout,
>    it's rare that users will need every version of every file in that
>    sparse-checkout, and they may not want that tax on their local storage.

I do agree we need to keep these in mind for some usecases, but I do
not agree these are universally true among sparse-checkout users.
However, our differences on this probably don't matter in practice
since you then immediately suggested...

> 2. Adding an opt-in backfill for a sparse-checkout definition will
>    prevent most on-demand downloads (although it might want to be
>    integrated into 'git fetch' behind an option to be really sure that
>    state continues in the future).

Yes, this would be great.  One question, though: integrated with
`fetch` or with `sparse-checkout set|add`?  If users adjust their
sparse-checkout definition, that might be a good time to allow them to
automatically trigger fixing the missing backfill at the same time.

> 3. Updating Git features to scope down to sparse-checkout will prevent
>    many of the remaining on-demand downloads.

Yes, though I'd clarify "scope down to sparse-checkout where it can
make sense".  Things like merge & bundle have to pay attention to
changes outside the sparse-checkout, but we can get commands like
diff/log -p/grep to scope down in breadth.

> 4. To be _absolutely sure_ that on-demand downloads don't happen, we
>    need an extra mode for Git and new ways of reporting partial results.
>    Without this mode, Git commands fail when triggering an on-demand
>    download and the network is unavailable.

While many commands might be able to produce partial results
realistically, I think things like merge & bundle should not support
such a mode and just fail if they are missing any data they normally
need.  Basically, we'd still have commands that would fail without a
network connection beyond push/pull/fetch, but this mode would limit
the list as much as possible through allowing commands to limit both
breadth and depth of the blobs we act upon.

> So, I'm saying that (4) is a direction that we could go. It also seems
> extremely difficult to do, so we should do (2) & (3) first, which will
> get us 99% of the way there.

Agreed on all three counts.