Re: sparse fetch, was Re: [PATCH 08/12] git-clone: support --path to do sparse clone

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 25 Jul 2008 01:47:03 -0700

Jeff King <peff@xxxxxxxx> writes:

> On Thu, Jul 24, 2008 at 06:41:03PM +0100, Johannes Schindelin wrote:
>
>> > As a user, I would expect "sparse clone" to also be sparse on the 
>> > fetching. That is, to not even bother fetching tree objects that we are 
>> > not going to check out. But that is a whole other can of worms from 
>> > local sparseness, so I think it is worth saving for a different series.
>> 
>> I think this is not even worth of a series.  Sure, it would have benefits 
>> for those who want sparse checkouts.  But it comes for a high price on 
>> everyone else:
>
> I agree there are a lot of issues. I am just thinking of the person who
> said they had a >100G repository. But I am also not volunteering to do
> it, so I will let somebody who really cares about it try to defend the
> idea.

I think sparse fetch is a lot worse than grafts and shallow clones which
are already bad.  These are all ways to introduce local inconsistency at
the object level and pretend everything is Ok, but the latter two do so
only at commit boundary and it is somewhat more manageable (but we still
do not handle it very well).  With sparse fetch, you cannot even guarantee
the integrity of individual commits with subtrees here and there missing.

I do think shallow checkout that says "I'll have the whole tree in the
index but the work tree will have only these paths checked out" makes
sense.  You do not need a fully populated work tree to create commits or
merges -- the only absolute minimum you need is a fully populated index.

In that sense, I think "protect index entries outside of these paths" (I
remember that the first round of this series was done around that notion)
is a wrong mentality to handle this.  We should think of this as more like
"you still populate the index with the whole tree, and you are free to
update them in any way you want, but we do not touch work tree outside
these areas".

This has a few ramifications:

 - If the user can somehow check out a path outside the "sparse" area, it
   is perfectly fine for the user to edit and "git add" it.  Such a method
   to check out a path outside the "sparse" area is a way to widen the
   "sparse" area the user originally set up;

 - When the user runs "merge", and it needs to present the user a working
   tree file because of conflicts at the file level, the user has to agree
   to widen the "sparse" area before being able to do so.  One way to do
   this is to refuse and fail the merge (and then the user needs to do
   that "unspecified way" of widening the "sparse" area first).  Another
   way would be to automatically widen the "sparse" area to include such
   conflicting paths.

 - And you would want to narrow it down after you do such a widening.

For many projects that has src/ and doc/ (git.git being one of them), it
is perfectly valid for a code person and a doc person to work in tandem.
In such a project, after the code person makes changes in her sparsely
checked out repository and making changes only to the src/ area and pushes
the results out, the doc person would run "git pull && git log -p
ORIG_HEAD" and updates the documentation in his sparsely checked out
repository that has only doc/ area.  The two parts are tied together and
they advance more or less in sync.  I think sparse checkout would be a
useful feature to help such a configuration.

Having said that, I however think that this can easily be misused as a CVS
style "one CVSROOT houses millions of totally unrelated projects" layout.
In CVS, the layout is perfectly fine because the system does not track
changes at anything higher than the level of individual files, but when
you naïvely map the layout to a system with tree-wide atomic commits, such
as git, it will defeat the whole point of using such a system.  The pace
these millions of unrelated projects advance do not have any relationship
with each other, but by tying them together in the same top-level tree,
the layout is introducing an unnecessary ordering between their commits.

--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html