Re: [PATCH 04/10] sparse-checkout: allow in-tree definitions

Elijah Newren <newren@xxxxxxxxx> · Wed, 17 Jun 2020 16:07:01 -0700

On Wed, May 20, 2020 at 10:52 AM Elijah Newren <newren@xxxxxxxxx> wrote:
>
> On Fri, May 8, 2020 at 8:42 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
> >
> > On 5/7/2020 6:58 PM, Junio C Hamano wrote:
> > > "Derrick Stolee via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> > >
> > >> One of the difficulties of using the sparse-checkout feature is not
> > >> knowing which directories are absolutely needed for working in a portion
> > >> of the repository. Some of this can be documented in README files or
> > >> included in a bootstrapping tool along with the repository. This is done
> > >> in an ad-hoc way by every project that wants to use it.
> > >>
> > >> Let's make this process easier for users by creating a way to define a
> > >> useful sparse-checkout definition inside the Git tree data. This has
> > >> several benefits. In particular, the data is available to anyone who has
> > >> a copy of the repository without needing a different data source.
> > >> Second, the needs of the repository can change over time and Git can
> > >> present a way to automatically update the working directory as these
> > >> sparse-checkout definitions change over time.
> > >
> > > And two lines of development can merge them together?
> > >
> > > Any time a new "feature" pops up that would eventually affect how
> > > "git clone" and "git checkout" work based on untrusted user data, we
> > > need to make sure there is no negative security implications.
> > >
> > > If it only boils down to "we have files that can record list of
> > > leading directory names and without offering extra 'flexibility'", I
> > > guess there aren't all that much that a malicious sparse definition
> > > can do and we would be safe, though.
> >
> > Yes. I hope that we can be extremely careful with this feature.
> > The RFC status of this series implicitly includes the question
> > "Should we do this at all?" I think the benefits outweigh the
> > risks, but we can minimize those risks with very careful design
> > and implementation.
> >
> > >> To use this feature, add the "--in-tree" option when setting or adding
> > >> directories to the sparse-checkout definition. For example:
> > >>
> > >>   $ git sparse-checkout set --in-tree .sparse/base
> > >>   $ git sparse-checkout add --in-tree .sparse/extra
> > >>
> > >> These commands add values to the multi-valued config setting
> > >> "sparse.inTree". When updating the sparse-checkout definition, these
> > >> values describe paths in the repository to find the sparse-checkout
> > >> data. After the commands listed earlier, we expect to see the following
> > >> in .git/config.worktree:
> > >>
> > >>      [sparse]
> > >>              intree = .sparse/base
> > >>              intree = .sparse/extra
> > >
> > > What does this say in human words?  "These two tracked files specify
> > > which paths should be in the working tree"?  Spelling it out here
> > > would help readers of this commit.
> >
> > You got it. Sounds good.
> >
> > >> When applying the sparse-checkout definitions from this config, the
> > >> blobs at HEAD:.sparse/base and HEAD:.sparse/extra are loaded.
> > >
> > > OK, so end-user edit to the working tree copy or what is added to
> > > the index does not count and only the committed version gets used.
> > >
> > > That makes it simple---I was wondering how we would operate when
> > > merging a branch with different contents in the .sparse/* files
> > > until the conflicts are resolved.
> >
> > It's worth testing this case so we can be sure what happens.
>
> During a merge or rebase or checkout -m, what happens if .sparse/extra
> has the following working tree content:
>
> [sparse]
>     dir = D
>     dir = X
> <<<<<< HEAD
>     dir = Y
> |||||| MERGE_BASE
> ======
>     inherit = .sparse/tools
> >>>>>>  MERGE_HEAD
>     inherit = .sparse/base
>
> and, of course, three different entries in the index?
>
> Also, do we use the version of the --in-tree file from the latest
> commit, from the index, or from the working tree?  (This is a question
> not only for merge and rebase, but also checkout with dirty changes
> and even checkout -m.)  Which one "wins"?
>
> And what if the user updates and commits an ill-formed version of the
> file -- is it equivalent to getting an empty cone with just the
> toplevel directory, equivalent to getting a complete checkout of
> everything, or something else?

Son pointed out that mercurial has a 'sparse' extension that has some
possible ideas of things we could do here; see
https://lore.kernel.org/git/CABPp-BGLBmWXrmPsTogyBFMgwYbHjN39oWbU=qDWroU1_fJaoQ@xxxxxxxxxxxxxx/
for some further discussion.