Re: How hard would it be to implement sparse fetching/pulling?

Jonathan Nieder <jrnieder@xxxxxxxxx> · Fri, 1 Dec 2017 10:24:46 -0800

Jeff Hostetler wrote:
> On 11/30/2017 6:43 PM, Philip Oakley wrote:

>> The 'companies' problem is that it tends to force a client-server, always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>
> Yes, this does tend to lead towards an always-online mentality.
> However, there are 2 parts:
> [a] dynamic object fetching for missing objects, such as during a
>     random command like diff or blame or merge.  We need this
>     regardless of usage -- because we can't always predict (or
>     dry-run) every command the user might run in advance.
> [b] batch fetch mode, such as using partial-fetch to match your
>     sparse-checkout so that you always have the blobs of interest
>     to you.  And assuming you don't wander outside of this subset
>     of the tree, you should be able to work offline as usual.
> If you can work within the confines of [b], you wouldn't need to
> always be online.

Just to amplify this: for our internal use we care a lot about
disconnected usage working.  So it is not like we have forgotten about
this use case.

> We might also add a part [c] with explicit commands to back-fill or
> alter your incomplete view of the ODB

Agreed, this will be a nice thing to add.

[...]
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>> known fixed format, and have the same effect (conceptually) as the
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>> elsewhere'.
>
> We do have something like this.  Jonathan can explain better than I, but
> basically, we denote possibly incomplete packfiles from partial clones
> and fetches as "promisor" and have special rules in the code to assert
> that a missing blob referenced from a "promisor" packfile is OK and can
> be fetched later if necessary from the "promising" remote.
>
> The main problem with markers or other lists of missing objects is
> that it has scale problems for large repos.

Any chance that we can get a design doc in Documentation/technical/
giving an overview of the design, with a brief "alternatives
considered" section describing this kind of thing?

E.g. some of the earlier descriptions like
 https://public-inbox.org/git/20170915134343.3814dc38@xxxxxxxxxxxxxxxxxxxxxxxxxxx/
 https://public-inbox.org/git/cover.1506714999.git.jonathantanmy@xxxxxxxxxx/
 https://public-inbox.org/git/20170113155253.1644-1-benpeart@xxxxxxxxxxxxx/
may help as a starting point.

Thanks,
Jonathan