I think it would be great if we high level agree on desired user experience, so let me put a few possible use cases here. 1. Init and fetch into a new repo with a sparse list. Preconditions: origin blah exists and has a lot of folders inside of src including "bar". Actions: git init foo && cd foo git config core.sparseAll true # New flag to activate all sparse operations by default so you don't need to pass options to each command. echo "src/bar" > .git/info/sparse-checkout git remote add origin blah git pull origin master Expected results: foo contains src/bar folder and nothing else, objects that are unrelated to this tree are not fetched. Notes: This should work same when fetch/merge/checkout operations are used in the right order. 2. Add a file and push changes. Preconditions: all steps above followed. touch src/bar/baz.txt && git add -A && git commit -m "added a file" git push origin master Expected results: changes are pushed to remote. 3. Clone a repo with a sparse list as a filter. Preconditions: same as for #1 Actions: echo "src/bar" > /tmp/blah-sparse-checkout git clone --sparse /tmp/blah-sparse-checkout blah # Clone should be the only command that would requires specific option key being passed. Expected results: same as for #1 plus /tmp/blah-sparse-checkout is copied into .git/info/sparse-checkout 4. Showing log for sparsely cloned repo. Preconditions: #3 is followed Actions: git log Expected results: recent changes that affect src/bar tree. 5. Showing diff. Preconditions: #3 is followed Actions: git diff HEAD^ HEAD Expected results: changes from the most recent commit affecting src/bar folder are shown. Notes: this can be tricky operation as filtering must be done to remove results from unrelated subtrees. *Note that I intentionally didn't mention use cases that are related to filtering by blob size as I think we should logically consider them as a separate, although related, feature. What do you think about these examples above? Is that something that more-or-less fits into current development? Are there other important flows that I've missed? -Vitaly On Thu, Nov 30, 2017 at 5:27 PM, Vitaly Arbuzov <vit@xxxxxxxx> wrote: > Jonathan, thanks for references, that is super helpful, I will follow > your suggestions. > > Philip, I agree that keeping original DVCS off-line capability is an > important point. Ideally this feature should work even with remotes > that are located on the local disk. > Which part of Jeff's work do you think wouldn't work offline after > repo initialization is done and sparse fetch is performed? All the > stuff that I've seen seems to be quite usable without GVFS. > I'm not sure if we need to store markers/tombstones on the client, > what problem does it solve? > > On Thu, Nov 30, 2017 at 3:43 PM, Philip Oakley <philipoakley@xxxxxxx> wrote: >> From: "Vitaly Arbuzov" <vit@xxxxxxxx> >>> >>> Found some details here: https://github.com/jeffhostetler/git/pull/3 >>> >>> Looking at commits I see that you've done a lot of work already, >>> including packing, filtering, fetching, cloning etc. >>> What are some areas that aren't complete yet? Do you need any help >>> with implementation? >>> >> >> comments below.. >> >>> >>> On Thu, Nov 30, 2017 at 9:01 AM, Vitaly Arbuzov <vit@xxxxxxxx> wrote: >>>> >>>> Hey Jeff, >>>> >>>> It's great, I didn't expect that anyone is actively working on this. >>>> I'll check out your branch, meanwhile do you have any design docs that >>>> describe these changes or can you define high level goals that you >>>> want to achieve? >>>> >>>> On Thu, Nov 30, 2017 at 6:24 AM, Jeff Hostetler <git@xxxxxxxxxxxxxxxxx> >>>> wrote: >>>>> >>>>> >>>>> >>>>> On 11/29/2017 10:16 PM, Vitaly Arbuzov wrote: >>>>>> >>>>>> >>>>>> Hi guys, >>>>>> >>>>>> I'm looking for ways to improve fetch/pull/clone time for large git >>>>>> (mono)repositories with unrelated source trees (that span across >>>>>> multiple services). >>>>>> I've found sparse checkout approach appealing and helpful for most of >>>>>> client-side operations (e.g. status, reset, commit, etc.) >>>>>> The problem is that there is no feature like sparse fetch/pull in git, >>>>>> this means that ALL objects in unrelated trees are always fetched. >>>>>> It may take a lot of time for large repositories and results in some >>>>>> practical scalability limits for git. >>>>>> This forced some large companies like Facebook and Google to move to >>>>>> Mercurial as they were unable to improve client-side experience with >>>>>> git while Microsoft has developed GVFS, which seems to be a step back >>>>>> to CVCS world. >>>>>> >>>>>> I want to get a feedback (from more experienced git users than I am) >>>>>> on what it would take to implement sparse fetching/pulling. >>>>>> (Downloading only objects related to the sparse-checkout list) >>>>>> Are there any issues with missing hashes? >>>>>> Are there any fundamental problems why it can't be done? >>>>>> Can we get away with only client-side changes or would it require >>>>>> special features on the server side? >>>>>> >> >> I have, for separate reasons been _thinking_ about the issue ($dayjob is in >> defence, so a similar partition would be useful). >> >> The changes would almost certainly need to be server side (as well as client >> side), as it is the server that decides what is sent over the wire in the >> pack files, which would need to be a 'narrow' pack file. >> >>>>>> If we had such a feature then all we would need on top is a separate >>>>>> tool that builds the right "sparse" scope for the workspace based on >>>>>> paths that developer wants to work on. >>>>>> >>>>>> In the world where more and more companies are moving towards large >>>>>> monorepos this improvement would provide a good way of scaling git to >>>>>> meet this demand. >> >> >> The 'companies' problem is that it tends to force a client-server, always-on >> on-line mentality. I'm also wanting the original DVCS off-line capability to >> still be available, with _user_ control, in a generic sense, of what they >> have locally available (including files/directories they have not yet looked >> at, but expect to have. IIUC Jeff's work is that on-line view, without the >> off-line capability. >> >> I'd commented early in the series at [1,2,3]. >> >> >> At its core, my idea was to use the object store to hold markers for the >> 'not yet fetched' objects (mainly trees and blobs). These would be in a >> known fixed format, and have the same effect (conceptually) as the >> sub-module markers - they _confirm_ the oid, yet say 'not here, try >> elsewhere'. >> >> The comaprison with submodules mean there is the same chance of >> de-synchronisation with triangular and upstream servers, unless managed. >> >> The server side, as noted, will need to be included as it is the one that >> decides the pack file. >> >> Options for a server management are: >> >> - "I accept narrow packs?" No; yes >> >> - "I serve narrow packs?" No; yes. >> >> - "Repo completeness checks on reciept": (must be complete) || (allow narrow >> to nothing). >> >> For server farms (e.g. Github..) the settings could be global, or by repo. >> (note that the completeness requirement and narrow reciept option are not >> incompatible - the recipient server can reject the pack from a narrow >> subordinate as incomplete - see below) >> >> * Marking of 'missing' objects in the local object store, and on the wire. >> The missing objects are replaced by a place holder object, which used the >> same oid/sha1, but has a short fixed length, with content “GitNarrowObject >> <oid>”. The chance that that string would actually have such an oid clash is >> the same as all other object hashes, so is a *safe* self-referential device. >> >> >> * The stored object already includes length (and inferred type), so we do >> know what it stands in for. Thus the local index (index file) should be able >> to be recreated from the object store alone (including the ‘promised / >> narrow / missing’ files/directory markers) >> >> * the ‘same’ as sub-modules. >> The potential for loss of synchronisation with a golden complete repo is >> just the same as for sub-modules. (We expected object/commit X here, but >> it’s not in the store). This could happen with a small user group who have >> locally narrow clones, who interact with their local narrow server for >> ‘backup’, and then fail to push further upstream to a server that mandates >> completeness. They could create a death by a thousand narrow cuts. Having a >> golden upstream config reference (indicating which is the upstream) could >> allow checks to ensure that doesn’t happen. >> >> The fsck can be taught the config option of 'allowNarrow'. >> >> The narrowness would be defined in a locally stored '.gitNarrowIgnore' file >> (which can include the size constraints being developed elsewhere on the >> list) >> >> As a safety it could be that the .gitNarrowIgnore is sent with the pack so >> that fold know what they missed, and fsck could check that they are locally >> not narrower than some specific project .gitNarrowIgnore spec. >> >> The benefit of this that the off-line operation capability of Git continues, >> which GVFS doesn’t quite do (accidental lock in to a client-server model aka >> all those other VCS systems). >> >> I believe that its all doable, and that Jeff H's work already puts much of >> it in place, or touches those places >> >> That said, it has been just _thinking_, without sufficient time to delve >> into the code. >> >> Phil >> >>>>>> >>>>>> PS. Please don't advice to split things up, as there are some good >>>>>> reasons why many companies decide to keep their code in the monorepo, >>>>>> which you can easily find online. So let's keep that part out the >>>>>> scope. >>>>>> >>>>>> -Vitaly >>>>>> >>>>> >>>>> >>>>> This work is in-progress now. A short summary can be found in [1] >>>>> of the current parts 1, 2, and 3. >>>>> >>>>>> * jh/object-filtering (2017-11-22) 6 commits >>>>>> * jh/fsck-promisors (2017-11-22) 10 commits >>>>>> * jh/partial-clone (2017-11-22) 14 commits >>>>> >>>>> >>>>> >>>>> [1] >>>>> >>>>> https://public-inbox.org/git/xmqq1skh6fyz.fsf@xxxxxxxxxxxxxxxxxxxxxxxxxxx/T/ >>>>> >>>>> I have a branch that contains V5 all 3 parts: >>>>> https://github.com/jeffhostetler/git/tree/core/pc5_p3 >>>>> >>>>> This is a WIP, so there are some rough edges.... >>>>> I hope to have a V6 out before the weekend with some >>>>> bug fixes and cleanup. >>>>> >>>>> Please give it a try and see if it fits your needs. >>>>> Currently, there are filter methods to filter all blobs, >>>>> all large blobs, and one to match a sparse-checkout >>>>> specification. >>>>> >>>>> Let me know if you have any questions or problems. >>>>> >>>>> Thanks, >>>>> Jeff >> >> >> [1,2] [RFC PATCH 0/3] Partial clone: promised blobs (formerly "missing >> blobs") >> https://public-inbox.org/git/BC1048A63B034E46A11A01758BC04855@PhilipOakley/ >> Date: Tue, 25 Jul 2017 21:48:46 +0100 >> https://public-inbox.org/git/8EE0108BA72B42EA9494B571DDE2005D@PhilipOakley/ >> Date: Sat, 29 Jul 2017 13:51:16 +0100 >> >> [3] [RFC PATCH v2 2/4] promised-object, fsck: introduce promised objects >> https://public-inbox.org/git/244AA0848E9D46F480E7CA407582A162@PhilipOakley/ >> Date: Sat, 29 Jul 2017 14:26:52 +0100 >>