Re: How hard would it be to implement sparse fetching/pulling?

Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> · Fri, 1 Dec 2017 12:23:40 -0500

On 11/30/2017 6:43 PM, Philip Oakley wrote:
From: "Vitaly Arbuzov" <vit@xxxxxxxx>
[...]
comments below..

On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@xxxxxxxx> wrote:
Hey Jeff,

It's great, I didn't expect that anyone is actively working on this.
I'll check out your branch, meanwhile do you have any design docs that
describe these changes or can you define high level goals that you
want to achieve?

On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx>
wrote:

On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
[...]

I have, for separate reasons been _thinking_ about the issue ($dayjob is in
defence, so a similar partition would be useful).

The changes would almost certainly need to be server side (as well as 
client
side), as it is the server that decides what is sent over the wire in 
the pack files, which would need to be a 'narrow' pack file.

Yes, there will need to be both client and server changes.
In the current 3 part patch series, the client sends a "filter_spec"
to the server as part of the fetch-pack/upload-pack protocol.
If the server chooses to honor it, upload-pack passes the filter_spec
to pack-objects to build an "incomplete" packfile omitting various
objects (currently blobs).  Proprietary servers will need similar
changes to support this feature.

Discussing this feature in the context of the defense industry
makes me a little nervous.  (I used to be in that area.)
What we have in the code so far may be a nice start, but
probably doesn't have the assurances that you would need
for actual deployment.  But it's a start....

If we had such a feature then all we would need on top is a separate
tool that builds the right "sparse" scope for the workspace based on
paths that developer wants to work on.

In the world where more and more companies are moving towards large
monorepos this improvement would provide a good way of scaling git to
meet this demand.

The 'companies' problem is that it tends to force a client-server, 
always-on
on-line mentality. I'm also wanting the original DVCS off-line 
capability to
still be available, with _user_ control, in a generic sense, of what they
have locally available (including files/directories they have not yet 
looked
at, but expect to have. IIUC Jeff's work is that on-line view, without the
off-line capability.

I'd commented early in the series at [1,2,3].

Yes, this does tend to lead towards an always-online mentality.
However, there are 2 parts:
[a] dynamic object fetching for missing objects, such as during a
    random command like diff or blame or merge.  We need this
    regardless of usage -- because we can't always predict (or
    dry-run) every command the user might run in advance.
[b] batch fetch mode, such as using partial-fetch to match your
    sparse-checkout so that you always have the blobs of interest
    to you.  And assuming you don't wander outside of this subset
    of the tree, you should be able to work offline as usual.
If you can work within the confines of [b], you wouldn't need to
always be online.

We might also add a part [c] with explicit commands to back-fill or
alter your incomplete view of the ODB (as I explained in response
to the "git diff <commit1> <commit2>" comment later in this thread.

At its core, my idea was to use the object store to hold markers for the
'not yet fetched' objects (mainly trees and blobs). These would be in a 
known fixed format, and have the same effect (conceptually) as the 
sub-module markers - they _confirm_ the oid, yet say 'not here, try 
elsewhere'.

We do have something like this.  Jonathan can explain better than I, but
basically, we denote possibly incomplete packfiles from partial clones
and fetches as "promisor" and have special rules in the code to assert
that a missing blob referenced from a "promisor" packfile is OK and can
be fetched later if necessary from the "promising" remote.

The main problem with markers or other lists of missing objects is
that it has scale problems for large repos.  Suppose I have 100M
blobs in my repo.  If I do a blob:none clone, I'd have 100M missing
blobs that would need tracking.  If I then do a batch fetch of the
blobs needed to do a sparse checkout of HEAD, I'd have to remove
those entries from the tracking data.  Not impossible, but not
speedy either.

The comaprison with submodules mean there is the same chance of
de-synchronisation with triangular and upstream servers, unless managed.

The server side, as noted, will need to be included as it is the one that
decides the pack file.

Options for a server management are:

- "I accept narrow packs?" No; yes

- "I serve narrow packs?" No; yes.

- "Repo completeness checks on reciept": (must be complete) || (allow 
narrow to nothing).

we have new config settings for the server to allow/reject
partial clones.

and we have code in fsck/gc to handle these incomplete packfiles.

For server farms (e.g. Github..) the settings could be global, or by repo.
(note that the completeness requirement and narrow reciept option are not
incompatible - the recipient server can reject the pack from a narrow
subordinate as incomplete - see below)

For now our scope is limited to partial clone and fetch.  We've not
considered push.

* Marking of 'missing' objects in the local object store, and on the wire.
The missing objects are replaced by a place holder object, which used the
same oid/sha1, but has a short fixed length, with content “GitNarrowObject
<oid>”. The chance that that string would actually have such an oid 
clash is
the same as all other object hashes, so is a *safe* self-referential 
device.

Again, there is a scale problem here.  If I have 100M missing blobs,
I can't afford to create 100M loose place holder files.  Or juggle
a 2GB file of missing objects on various operations.

* The stored object already includes length (and inferred type), so we do
know what it stands in for. Thus the local index (index file) should be 
able
to be recreated from the object store alone (including the ‘promised /
narrow / missing’ files/directory markers)

* the ‘same’ as sub-modules.
The potential for loss of synchronisation with a golden complete repo is
just the same as for sub-modules. (We expected object/commit X here, but 
it’s not in the store). This could happen with a small user group who 
have locally narrow clones, who interact with their local narrow server 
for ‘backup’, and then fail to push further upstream to a server that 
mandates completeness. They could create a death by a thousand narrow 
cuts. Having a golden upstream config reference (indicating which is the 
upstream) could allow checks to ensure that doesn’t happen.

The fsck can be taught the config option of 'allowNarrow'.

We've updated fsck to be aware of partial clones.

The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
(which can include the size constraints being developed elsewhere on the
list)

As a safety it could be that the .gitNarrowIgnore is sent with the pack so
that fold know what they missed, and fsck could check that they are locally
not narrower than some specific project .gitNarrowIgnore spec.

Currently, we store the filter_spec used for the partial clone (or the
first partial fetch) as a default for subsequent fetches, but we limit
it there.  That is, for operations like checkout or blame or whatever,
it doesn't matter why a blob is missing or what filter criteria was
used to cause it to be omitted -- just that it is.  Any time we need
a missing object, we have to go get it -- whether that is via dynamic
or bulk fetching.

The benefit of this that the off-line operation capability of Git 
continues,
which GVFS doesn’t quite do (accidental lock in to a client-server model 
aka
all those other VCS systems).

Yes, there are limitations of GVFS and you must be online to use it.
It essentially does a blob:none filter and dynamically faults in every
blob.  (But it is also using a kernel driver and daemon to dynamically
populate the file system when a file is first opened for writing --
much like copy-on-write virtual memory.  And yes, these 2 operations
are independent, but currently combined in the GVFS code.)

And yes, there is work here to make sure that most normal
operations can continue to work offline.

I believe that its all doable, and that Jeff H's work already puts much of
it in place, or touches those places

That said, it has been just _thinking_, without sufficient time to delve
into the code.

Phil
[...]

Thanks
Jeff