Re: [PATCH] sparse-checkout.txt: new document with sparse-checkout directions

Elijah Newren <newren@xxxxxxxxx> · Fri, 14 Oct 2022 21:37:50 -0700

On Fri, Oct 14, 2022 at 7:17 PM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
>
> Elijah Newren <newren@xxxxxxxxx> 于2022年10月6日周四 15:53写道：
> >
> > On Fri, Sep 30, 2022 at 2:54 AM ZheNing Hu <adlternative@xxxxxxxxx> wrote:
> > >
> > > Elijah Newren <newren@xxxxxxxxx> 于2022年9月28日周三 13:38写道：
> > > >
[...]
> > As an example, the repository where we first applied sparse-checkouts
> > to (and which had the complicated dependencies) does not use partial
> > clones or a sparse-index.   While partial clone and sparse-index might
> > help a little, the .git directory for a full clone is merely 2G, and
> > there are less than 100K entries in the index.  However,
> > sparse-checkout helps out a lot.
>
> Yes, you make a good explanation here that we don't necessarily need
> to apply all these kinds of features. But I still feel a little confuse: Where
> does the time savings come from? Is it saved by the time reduction of
> git checkout? Or is it the reduction of some unnecessary working tree scans
> during test/build time?

It is neither git checkout time, nor tree scans; it's the ability to
avoid building larging parts of the project coupled with the
significantly better responsiveness of IDEs when project scope is
limited.  When directories are entirely missing, we don't need to
build any of the code in those directories and can instead just use
already built artifacts from the most recent point in history that has
been built on our continuous integration infrastructure.  (Note: our
sparsification tool will keep any modules/directories where there have
been modifications since the most recent upstream commit that has been
built, so we don't risk getting a wrong build via this strategy.)

[...]
> > > 1. mount the large git repo on the server to local.
> > > 2. just ssh to a remote server to run integration tests.
> > > 3. use an external tool to run integration tests on the remote server.
> >
> > Are you suggesting #1 as a way for just handling the git history, or
> > also for handling the worktree with some kind of virtual file system
> > where not all files are actually written locally?  If you're only
> > talking about the history, then you're kind of going on a tangent
> > unrelated to this document.  If you're talking about worktrees and
> > virtual file systems, then Git proper doesn't have anything of the
> > sort currently.  There are at least two solutions in this space --
> > Microsoft's Git-VFS (which I think they are phasing out) and Google's
> > similar virtual file system -- but I'm not currently particularly
> > interested in either one.
> >
>
> Here I mean git nfs, or some kind of git virtual file system, or some
> git workspace, I don't really understand why they are now
> phasing out?

You'd have to ask them, or read their comments on it.  I think they
believe sparse-checkout with a normal file system is or will be better
than the behavior they are getting from their virtual file system (and
they've put a lot of really good work behind making sure that is the
case).

[...]
> Some users may really want to focus only on their subprojects, so I think
> "git log -p" shouldn't show files that don't satisfy the
> sparse-checkout patterns,
> and "git grep" too. But some users may need to search something globally,
> and I think those people are in the minority, so maybe there should be a
> "git log -p --scrope=all" or "git grep --scrope=all" for them.

Good to know you're in the "Behavior A" camp and we've got another
vote for implementing things in that direction.  A couple of small
points, though:
  * It's --scope rather than --scrope.  ;-)
  * I have to disagree here slightly about people using a --scope=all
flag -- I don't think users should have to specify it with every grep
or log invocation.  Users in the "Behavior B" camp would want
`--scope=all` behavior for nearly every grep and log -p invocation
they make; it's annoying and unfair to force them to spell it out
every time.  So, I think we need a configuration option.

[...]
> > Sometimes merge has to download blobs to know if there are conflicts
> > or not.  But only sometimes.  Since tree objects have the hashes of
> > the blobs, having the tree objects is sufficient to determine which
> > side(s) of history modified each path.
> >
> > If both sides of history modified the same file, then you *might* have
> > conflicts, and you indeed need the blobs to verify.  But if only one
> > side of history modified a file and the other left it alone, then
> > there is no conflict.
>
> I think I probably get it. e.g. tree of HEAD of user1 have a tree entry
> "a4e1fc out/file1" which is same SHA1 to blob in merge base, because
> it's out of sparse-checkout specification, and it fetch a commit of user2,
> and its tree has a tree entry "13f91e out/file1", so git merge doesn't really
> need to check the contents of the file here, because only one side
> changes it.

Precisely.  :-)