Re: How hard would it be to implement sparse fetching/pulling?

"Philip Oakley" <philipoakley@xxxxxxx> · Sat, 2 Dec 2017 16:59:27 -0000

Hi Jonathan,

Thanks for the outline. It has help clarify some points and see the very 
similar alignments.
The one thing I wasn't clear about is the "promised" objects/remote. Is that 
"promisor" remote a fixed entity, or could it be one of many remotes that 
could be a "provider"? (sort of like fetching sub-modules...)
Philip

From: "Jonathan Nieder" <jrnieder@xxxxxxxxx>
Sent: Friday, December 01, 2017 2:51 AM
Hi Vitaly,

Vitaly Arbuzov wrote:

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.
I think one thing this thread is pointing to is a lack of overview
documentation about how the 'partial clone' series currently works.
The basic components are:

1. extending git protocol to (1) allow fetching only a subset of the
   objects reachable from the commits being fetched and (2) later,
   going back and fetching the objects that were left out.

   We've also discussed some other protocol changes, e.g. to allow
   obtaining the sizes of un-fetched objects without fetching the
   objects themselves

2. extending git's on-disk format to allow having some objects not be
   present but only be "promised" to be obtainable from a remote
   repository.  When running a command that requires those objects,
   the user can choose to have it either (a) error out ("airplane
   mode") or (b) fetch the required objects.

   It is still possible to work fully locally in such a repo, make
   changes, get useful results out of "git fsck", etc.  It is kind of
   similar to the existing "shallow clone" feature, except that there
   is a more straightforward way to obtain objects that are outside
   the "shallow" clone when needed on demand.

3. improving everyday commands to require fewer objects.  For
   example, if I run "git log -p", then I way to see the history of
   most files but I don't necessarily want to download large binary
   files just to print 'Binary files differ' for them.

   And by the same token, we might want to have a mode for commands
   like "git log -p" to default to restricting to a particular
   directory, instead of downloading files outside that directory.

   There are some fundamental changes to make in this category ---
   e.g. modifying the index format to not require entries for files
   outside the sparse checkout, to avoid having to download the
   trees for them.

The overall goal is to make git scale better.

The existing patches do (1) and (2), though it is possible to do more
in those categories. :)  We have plans to work on (3) as well.

These are overall changes that happen at a fairly low level in git.
They mostly don't require changes command-by-command.

Thanks,
Jonathan