Re: How hard would it be to implement sparse fetching/pulling?

Vitaly Arbuzov <vit@xxxxxxxx> · Thu, 30 Nov 2017 17:51:01 -0800

I think it would be great if we high level agree on desired user
experience, so let me put a few possible use cases here.

1. Init and fetch into a new repo with a sparse list.
Preconditions: origin blah exists and has a lot of folders inside of
src including "bar".
Actions:
git init foo && cd foo
git config core.sparseAll true # New flag to activate all sparse
operations by default so you don't need to pass options to each
command.
echo "src/bar" > .git/info/sparse-checkout
git remote add origin blah
git pull origin master
Expected results: foo contains src/bar folder and nothing else,
objects that are unrelated to this tree are not fetched.
Notes: This should work same when fetch/merge/checkout operations are
used in the right order.

2. Add a file and push changes.
Preconditions: all steps above followed.
touch src/bar/baz.txt && git add -A && git commit -m "added a file"
git push origin master
Expected results: changes are pushed to remote.

3. Clone a repo with a sparse list as a filter.
Preconditions: same as for #1
Actions:
echo "src/bar" > /tmp/blah-sparse-checkout
git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be
the only command that would requires specific option key being passed.
Expected results: same as for #1 plus /tmp/blah-sparse-checkout is
copied into .git/info/sparse-checkout

4. Showing log for sparsely cloned repo.
Preconditions: #3 is followed
Actions:
git log
Expected results: recent changes that affect src/bar tree.

5. Showing diff.
Preconditions: #3 is followed
Actions:
git diff HEAD^ HEAD
Expected results: changes from the most recent commit affecting
src/bar folder are shown.
Notes: this can be tricky operation as filtering must be done to
remove results from unrelated subtrees.

*Note that I intentionally didn't mention use cases that are related
to filtering by blob size as I think we should logically consider them
as a separate, although related, feature.

What do you think about these examples above? Is that something that
more-or-less fits into current development? Are there other important
flows that I've missed?

-Vitaly

On Thu, Nov 30, 2017 at 5:27 PM, Vitaly Arbuzov <vit@xxxxxxxx> wrote:
> Jonathan, thanks for references, that is super helpful, I will follow
> your suggestions.
>
> Philip, I agree that keeping original DVCS off-line capability is an
> important point. Ideally this feature should work even with remotes
> that are located on the local disk.
> Which part of Jeff's work do you think wouldn't work offline after
> repo initialization is done and sparse fetch is performed? All the
> stuff that I've seen seems to be quite usable without GVFS.
> I'm not sure if we need to store markers/tombstones on the client,
> what problem does it solve?
>
> On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley <philipoakley@xxxxxxx> wrote:
>> From: "Vitaly Arbuzov" <vit@xxxxxxxx>
>>>
>>> Found some details here: https://github.com/jeffhostetler/git/pull/3
>>>
>>> Looking at commits I see that you've done a lot of work already,
>>> including packing, filtering, fetching, cloning etc.
>>> What are some areas that aren't complete yet? Do you need any help
>>> with implementation?
>>>
>>
>> comments below..
>>
>>>
>>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@xxxxxxxx> wrote:
>>>>
>>>> Hey Jeff,
>>>>
>>>> It's great, I didn't expect that anyone is actively working on this.
>>>> I'll check out your branch, meanwhile do you have any design docs that
>>>> describe these changes or can you define high level goals that you
>>>> want to achieve?
>>>>
>>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote:
>>>>>>
>>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> I'm looking for ways to improve fetch/pull/clone time for large git
>>>>>> (mono)repositories with unrelated source trees (that span across
>>>>>> multiple services).
>>>>>> I've found sparse checkout approach appealing and helpful for most of
>>>>>> client-side operations (e.g. status, reset, commit, etc.)
>>>>>> The problem is that there is no feature like sparse fetch/pull in git,
>>>>>> this means that ALL objects in unrelated trees are always fetched.
>>>>>> It may take a lot of time for large repositories and results in some
>>>>>> practical scalability limits for git.
>>>>>> This forced some large companies like Facebook and Google to move to
>>>>>> Mercurial as they were unable to improve client-side experience with
>>>>>> git while Microsoft has developed GVFS, which seems to be a step back
>>>>>> to CVCS world.
>>>>>>
>>>>>> I want to get a feedback (from more experienced git users than I am)
>>>>>> on what it would take to implement sparse fetching/pulling.
>>>>>> (Downloading only objects related to the sparse-checkout list)
>>>>>> Are there any issues with missing hashes?
>>>>>> Are there any fundamental problems why it can't be done?
>>>>>> Can we get away with only client-side changes or would it require
>>>>>> special features on the server side?
>>>>>>
>>
>> I have, for separate reasons been _thinking_ about the issue ($dayjob is in
>> defence, so a similar partition would be useful).
>>
>> The changes would almost certainly need to be server side (as well as client
>> side), as it is the server that decides what is sent over the wire in the
>> pack files, which would need to be a 'narrow' pack file.
>>
>>>>>> If we had such a feature then all we would need on top is a separate
>>>>>> tool that builds the right "sparse" scope for the workspace based on
>>>>>> paths that developer wants to work on.
>>>>>>
>>>>>> In the world where more and more companies are moving towards large
>>>>>> monorepos this improvement would provide a good way of scaling git to
>>>>>> meet this demand.
>>
>>
>> The 'companies' problem is that it tends to force a client-server, always-on
>> on-line mentality. I'm also wanting the original DVCS off-line capability to
>> still be available, with _user_ control, in a generic sense, of what they
>> have locally available (including files/directories they have not yet looked
>> at, but expect to have. IIUC Jeff's work is that on-line view, without the
>> off-line capability.
>>
>> I'd commented early in the series at [1,2,3].
>>
>>
>> At its core, my idea was to use the object store to hold markers for the
>> 'not yet fetched' objects (mainly trees and blobs). These would be in a
>> known fixed format, and have the same effect (conceptually) as the
>> sub-module markers - they _confirm_ the oid, yet say 'not here, try
>> elsewhere'.
>>
>> The comaprison with submodules mean there is the same chance of
>> de-synchronisation with triangular and upstream servers, unless managed.
>>
>> The server side, as noted, will need to be included as it is the one that
>> decides the pack file.
>>
>> Options for a server management are:
>>
>> - "I accept narrow packs?" No; yes
>>
>> - "I serve narrow packs?" No; yes.
>>
>> - "Repo completeness checks on reciept": (must be complete) || (allow narrow
>> to nothing).
>>
>> For server farms (e.g. Github..) the settings could be global, or by repo.
>> (note that the completeness requirement and narrow reciept option are not
>> incompatible - the recipient server can reject the pack from a narrow
>> subordinate as incomplete - see below)
>>
>> * Marking of 'missing' objects in the local object store, and on the wire.
>> The missing objects are replaced by a place holder object, which used the
>> same oid/sha1, but has a short fixed length, with content “GitNarrowObject
>> <oid>”. The chance that that string would actually have such an oid clash is
>> the same as all other object hashes, so is a *safe* self-referential device.
>>
>>
>> * The stored object already includes length (and inferred type), so we do
>> know what it stands in for. Thus the local index (index file) should be able
>> to be recreated from the object store alone (including the ‘promised /
>> narrow / missing’ files/directory markers)
>>
>> * the ‘same’ as sub-modules.
>> The potential for loss of synchronisation with a golden complete repo is
>> just the same as for sub-modules. (We expected object/commit X here, but
>> it’s not in the store). This could happen with a small user group who have
>> locally narrow clones, who interact with their local narrow server for
>> ‘backup’, and then fail to push further upstream to a server that mandates
>> completeness. They could create a death by a thousand narrow cuts. Having a
>> golden upstream config reference (indicating which is the upstream) could
>> allow checks to ensure that doesn’t happen.
>>
>> The fsck can be taught the config option of 'allowNarrow'.
>>
>> The narrowness would be defined in a locally stored '.gitNarrowIgnore' file
>> (which can include the size constraints being developed elsewhere on the
>> list)
>>
>> As a safety it could be that the .gitNarrowIgnore is sent with the pack so
>> that fold know what they missed, and fsck could check that they are locally
>> not narrower than some specific project .gitNarrowIgnore spec.
>>
>> The benefit of this that the off-line operation capability of Git continues,
>> which GVFS doesn’t quite do (accidental lock in to a client-server model aka
>> all those other VCS systems).
>>
>> I believe that its all doable, and that Jeff H's work already puts much of
>> it in place, or touches those places
>>
>> That said, it has been just _thinking_, without sufficient time to delve
>> into the code.
>>
>> Phil
>>
>>>>>>
>>>>>> PS. Please don't advice to split things up, as there are some good
>>>>>> reasons why many companies decide to keep their code in the monorepo,
>>>>>> which you can easily find online. So let's keep that part out the
>>>>>> scope.
>>>>>>
>>>>>> -Vitaly
>>>>>>
>>>>>
>>>>>
>>>>> This work is in-progress now.  A short summary can be found in [1]
>>>>> of the current parts 1, 2, and 3.
>>>>>
>>>>>> * jh/object-filtering (2017-11-22) 6 commits
>>>>>> * jh/fsck-promisors (2017-11-22) 10 commits
>>>>>> * jh/partial-clone (2017-11-22) 14 commits
>>>>>
>>>>>
>>>>>
>>>>> [1]
>>>>>
>>>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/T/
>>>>>
>>>>> I have a branch that contains V5 all 3 parts:
>>>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3
>>>>>
>>>>> This is a WIP, so there are some rough edges....
>>>>> I hope to have a V6 out before the weekend with some
>>>>> bug fixes and cleanup.
>>>>>
>>>>> Please give it a try and see if it fits your needs.
>>>>> Currently, there are filter methods to filter all blobs,
>>>>> all large blobs, and one to match a sparse-checkout
>>>>> specification.
>>>>>
>>>>> Let me know if you have any questions or problems.
>>>>>
>>>>> Thanks,
>>>>> Jeff
>>
>>
>> [1,2]  [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing
>> blobs")
>> https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/
>> Date: Tue, 25 Jul 2017 21:48:46 +0100
>> https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/
>> Date: Sat, 29 Jul 2017 13:51:16 +0100
>>
>> [3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects
>> https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/
>> Date: Sat, 29 Jul 2017 14:26:52 +0100
>>