Re: How hard would it be to implement sparse fetching/pulling?

Jonathan Nieder <jrnieder@xxxxxxxxx> · Thu, 30 Nov 2017 18:51:06 -0800

Hi Vitaly,

Vitaly Arbuzov wrote:

> I think it would be great if we high level agree on desired user
> experience, so let me put a few possible use cases here.

I think one thing this thread is pointing to is a lack of overview
documentation about how the 'partial clone' series currently works.
The basic components are:

 1. extending git protocol to (1) allow fetching only a subset of the
    objects reachable from the commits being fetched and (2) later,
    going back and fetching the objects that were left out.

    We've also discussed some other protocol changes, e.g. to allow
    obtaining the sizes of un-fetched objects without fetching the
    objects themselves

 2. extending git's on-disk format to allow having some objects not be
    present but only be "promised" to be obtainable from a remote
    repository.  When running a command that requires those objects,
    the user can choose to have it either (a) error out ("airplane
    mode") or (b) fetch the required objects.

    It is still possible to work fully locally in such a repo, make
    changes, get useful results out of "git fsck", etc.  It is kind of
    similar to the existing "shallow clone" feature, except that there
    is a more straightforward way to obtain objects that are outside
    the "shallow" clone when needed on demand.

 3. improving everyday commands to require fewer objects.  For
    example, if I run "git log -p", then I way to see the history of
    most files but I don't necessarily want to download large binary
    files just to print 'Binary files differ' for them.

    And by the same token, we might want to have a mode for commands
    like "git log -p" to default to restricting to a particular
    directory, instead of downloading files outside that directory.

    There are some fundamental changes to make in this category ---
    e.g. modifying the index format to not require entries for files
    outside the sparse checkout, to avoid having to download the
    trees for them.

The overall goal is to make git scale better.

The existing patches do (1) and (2), though it is possible to do more
in those categories. :)  We have plans to work on (3) as well.

These are overall changes that happen at a fairly low level in git.
They mostly don't require changes command-by-command.

Thanks,
Jonathan