Re: [PATCH 0/2] Sparse checkout status

Elijah Newren <newren@xxxxxxxxx> · Wed, 17 Jun 2020 09:48:22 -0700

Hi Son,

Thanks for the feedback.

On Wed, Jun 17, 2020 at 12:40 AM Son Luong Ngoc <sluongng@xxxxxxxxx> wrote:
>
> Hi Elijah,
>
> > Some of the feedback of folks trying out sparse-checkouts at $dayjob is that
> > sparse checkouts can sometimes be disorienting; users can forget that they
> > had a sparse-checkout and then wonder where files went.
>
> I agree with this observation: that the current 'git sparse-checkout' experience
> could be a bit 'lost' for end users, who may or may not be familiar
> with git's 'arcane magic'.
>
> Currently the only way to verify what's going on is to either run
> 'tree <repo-root-dir>'
> or 'cat .git/info/sparse-checkout' (human-readable but not easy).

Well, there is `git sparse-checkout list`, assuming users know they
are in a sparse-checkout, but the whole point of my suggested change
is that they sometimes don't.

> > This series adds some output to 'git status' and modifies git-prompt slightly as an attempt
> > to help.
>
> This is a great idea but I suggest to put a config/flag to let users
> enable/disable this.
>
> Git status is often utilized in automated commands (IDE, shell prompt,
> etc...) and there may be
> performance implications down the line not being able to skip this bit
> of information out.

This surprises me; I considered performance while writing it and kept
it simple on that basis.  In particular:
  * This does not cause any reading or writing of any extra files; it
is done solely with information that is already loaded.
  * If users aren't in a sparse-checkout, their performance overhead
is a single if-check, which I doubt anyone can measure.
  * If they are in a sparse-checkout, then they'd get one extra loop
over files in the index to check the SKIP_WORKTREE bit.

In which cases would performance implications be a concern?  For a
very simple point of reference, in a sparse-checkout of the linux
kernel (using --cone mode and only selecting the drivers/ directory),
I see the following timings for 'git status' in a clean checkout:

Without my change:
[newren@tiger linux-stable (hwmon-updates|SPARSE)]$ hyperfine --warmup
1 'git status'
Benchmark #1: git status
  Time (mean ± σ):      78.8 ms ±   2.8 ms    [User: 48.9 ms, System: 76.9 ms]
  Range (min … max):    74.0 ms …  84.0 ms    38 runs

With my change:
[newren@eenie linux-stable (hwmon-updates|SPARSE)]$ hyperfine --warmup
1 'git status'
Benchmark #1: git status
  Time (mean ± σ):      79.8 ms ±   2.7 ms    [User: 49.3 ms, System: 77.7 ms]
  Range (min … max):    74.8 ms …  84.5 ms    37 runs

I know the linux kernel is tiny compared to repos like Windows or
Office, but the relative scaling considerations are identical: it's
one extra loop through the cached entries checking a bit for each
entry.  If people are worried about the "extra loop", I could find an
existing loop to modify and just add an extra if-block in it so that
we have the same number of loops.  (I'm doubtful that'd actually help,
but if the concern is an extra loop, it'd certainly avoid that.)
Anyway, if you've got more information about it being too costly, I'm
happy to listen.  Otherwise, the overhead seems pretty small to me and
it's only paid by those who would benefit from the information.

However, all that said, I have good news: Peff already implemented the
flag users can use to avoid this extra output, and did so back in
September of 2009.  It's called "--porcelain".  Automated commands
should already be using it, and if they aren't, they are what needs
fixing -- not the long form status output.

> > For reference, I suspect that in repositories that are large enough that
> > people always use sparse-checkouts (e.g. windows or office repos), that this
> > isn't a problem. But when the repository is approximately
> > linux-kernel-sized, then it is reasonable for some folks to have a full
> > checkout. sparse-checkouts, however, can provide various build system and
> > IDE performance improvements, so we have a split of users who have
> > sparse-checkouts and those who have full checkouts. It's easy for users who
> > are bridging in between the two worlds or just trying out sparse-checkouts
> > for the first time to get confused.
>
> One of our users noted that the experience is improved when combining
> 'git worktree' with sparse-checkout.
> That way you get the correct sparsity for the topic that you are working on.
>
> In a way, the current sparse-checkout experience is similar to a user
> running 'git checkout <rev>' directly
> instead of checking out a branch.
> It does not feel tangible and reproducible.
>
> I was hoping that these concerns will be addressed once the In-Tree
> Sparse-Checkout Definition RFC[1] patch landed.
> We should then be able to print out which Definition File(s) (we often
> call it manifests) were used,
> and ideally, only the top most file(s) in the inheritance tree.
>
> So the ideal experience, in my mind, is something of this sort:
>
>     git sc init --cone
>
>     # assuming a inherited from b and c
>     git sc add --in-tree manifest-dir/module-a.manifest
>     git sc add --in-tree manifest-dir/module-d.manifest
>
>     git sc status
>         Your sparse checkout includes following definition(s):
>         (1) manifest-dir/module-a.manifest
>         (2) manifest-dir/module-d.manifest
>
>     git sc status --all
>         Your sparse checkout includes following definition(s):
>         (1) manifest-dir/module-a.manifest
>         (2) manifest-dir/module-d.manifest
>         (3) manifest-dir/module-b.manifest (included by 1)
>         (4) manifest-dir/module-c.manifest (included by 1)

I think having a 'git sparse-checkout status' would be a fine
subcommand, and output like the above -- possibly also including other
bits Stolee or I mentioned elsewhere in this thread -- would be cool
and would be helpful; it'd complement what I'm doing here quite
nicely.

But you're solving a related problem rather than the one I was
focusing on, and you have left the issue I was focusing on
unaddressed.  In particular, if users forgot that they sparsified in
the first place, how are they going to know to run `git
sparse-checkout status [--all]`?

I think having a simple line of output in `git status` would remind
them.  With that reminder, they could today then go run 'git
sparse-checkout list' or 'gvfs health' (as Stolee mentioned he uses
internally) or './sparsify --info' (as I use internally) to get more
info.  In the future we could provide additional things for them as
well, such as your 'git sparse-checkout status'.

An aside, though, since you linked to the in-tree sparse-checkout
definitions: When I reviewed that series, the possibility of merge
conflicts and not knowing what sparse-checkout should have checked out
when the in-tree defintions themselves were in a conflicted state
seemed to me to be a pretty tough sticking point.  I'm hoping someone
has a clever solution, but I still don't yet.  Do you?

Thanks,
Elijah