Re: Sparse clones (Was: Re: [PATCH 1/2] upload-pack: support subtree packing)

Elijah Newren <newren@xxxxxxxxx> · Tue, 27 Jul 2010 21:31:55 -0600

2010/7/27 Avery Pennarun <apenwarr@xxxxxxxxx>:
> Note that if you happen to want to implement it in a way that you'll
> also get all the commit objects from your submodules too (which I
> highly encourage :)) then downloading the trees is the easiest way.
> Otherwise you won't know which submodule commits you need.

Makes sense.  Seems like a good reason to include all the trees.

> Since downloading commits is so cheap anyway, I'd suggest just
> defaulting to downloading all the refs, as clone currently does.  If
> people don't like it, they can do what they currently do:
>
>   git init
>   git remote add ...
>   git fetch
>
> Not that pretty, but then again, it's rarely needed.

Would you suggest then parsing the limiting arguments passed to clone
and disallowing refs?  Or just making it non-useful by always
appending "--all HEAD"?

>> 2) Sparse checkouts are automatically invoked with the path(s) from
>>   the specified rev-list arguments.
<snip>
> I don't totally understand what you mean here.  But I do think that if
Basically, I mean what you stated much more succinctly and eloquently
right here:
> I guess my point is, more complex exclusions could always be added
> later but they aren't so important right away.

>> 4) All revision-walking operations automatically use these limiting args.
<snip>
> It does sound sort of elegant: this way they *won't* run into the missing objects.
> Beware, however, that
>
>   git log -- Documentation
>
> outputs a different set of commits than just
>
>   git log

Yes, exactly.  In a sparse clone, why wouldn't one want the behavior
of the former automatically, without having to specify the paths on
the command line every time they ran log (or rev-list or fast-export
or...etc., especially if they cloned N directories rather than just
1)?

Actually, I can kind of see the desire to see the 'real' log since the
users do happen to have all commits locally, but it almost seems like
it should be the case that requires a special option to be passed to
git log ('--ignore-sparse-limiting'?).  But trying to get that option
to work in conjunction with other options (--stat, -S, -p, etc.) would
be really hard, if not impossible.

>> 5) "Densifying" a sparse clone can be done
<snip>
> I think this would work, but unless you want to re-download some
> (possibly lots of) objects you've already got, it would require some
> kind of extra support from the server, I think.  Maybe that's a rare
> enough case that few people will care and it could be fixed later.

For my first implementation, my plan was to simply re-download ALL
(not just some or lots of) objects I've already got in such a case.  A
bit wasteful to be sure, but I was hoping it was rare enough to
"densify" a clone that it wouldn't be a big deal...and that support
for smarter downloads could be added later.

> I don't think the pull vs. fetch distinction is valid; I would be very
> surprised if pull un-sparsified my checkout, just as I would be
> surprised if merge did.  And pull is just fetch+merge.

Right, I don't think pull should un-sparsify either the checkout OR
the clone by default (it should have fetch pass the same limiting
arguments and only download an equivalently sparse set of updates).
Your point about pull=fetch+merge (or fetch+rebase) makes sense, which
I guess means that un-sparsifying a clone+checkout should be a
separate toplevel command ("densify"?) rather than a special option
for fetch/pull.

>> 6) Cloning-from/fetching-from/pushing-to sparse clones is supported.
>>
>> Future fetches and pushes also make use of the limiting arguments.
>> Receives do as well, but only to make sure the pack obtained is not
>> "more sparse" than what the receiving repository already has.
>> (uploads ignore the stored rev-list arguments, instead using the
>> rev-list arguments passed to it -- it will die if asked for content
>> not locally available to it.)
>
> This scares me a little.  It's a reminder that it's all-too-easy to
> get your repository into a really messed up state by going in and
> screwing with your sparseness parameters at the wrong time.

I don't follow.  Why would people be "screwing with sparseness parameters"?

My basic idea was that there would be only three ways to change
sparseness parameters for clones, with only the first two documented:
the initial clone command, the "densify" command (someone probably
needs to think of a better name), and reading the source code to
figure out what bits on your disk to change and changing them.

Here's why I want the clone-able/fetch-able/pull-able sparse clone
functionality:

I like having translators (who only need maybe one file) or technical
writers (who only need the Documentation/ subdirectory) or other
similar folks having the ability to collaborate on the subset of the
repository that they need to do their work.  Thus, it makes sense for
them to be able to clone from, pull from, and push to each other.  The
only two rules that I think are necessary to enable such behavior are:

* No repository can provide information that it doesn't have (should
be pretty easy to enforce...)
* No repository accepts less data than it expects in its repository
(i.e. you can push to a sparse clone or a real clone, but need to
provide data that fulfills it's rev-list limiting arguments)

> It would make me more comfortable if there was some kind of "oh god,
> just fix it by downloading any objects you think are missing" mode :)
> In fact, git could benefit from that in general - every now and then
> someone on the list asks about a repository they managed to mangle by
> corrupting a pack or something, and there's no really good answer to
> that.

For sparse clones, Isn't that mode just running the "densify" command
with no limiting arguments?

>> 7) Operations that need unavailable data simply error out
>>
>> Examples: merge, cherry-pick, rebase (and upload-pack in a sparse
>> clone).  However, hopefully the error messages state what extra
>> information needs to be downloaded so the user can appropriately
>> "densify" their repository.
>
> That sounds good to me.

Thanks for the detailed feedback.  :-)
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html