On Fri, 25 Aug 2017 08:14:08 +0200 Christian Couder <christian.couder@xxxxxxxxx> wrote: > As Git is used by more and more by people having different needs, I > think it is not realistic to expect that we can optimize its object > storage for all these different needs. So a better strategy is to just > let them store objects in external stores. [snip] > About these many use cases, I gave the "really big binary files" > example which is why Git LFS exists (and which GitLab is interested in > better solving), and the "really big number of files that are fetched > only as needed" example which Microsoft is interested in solving. I > could also imagine that some people have both big text files and big > binary files in which case the "core.bigfilethreshold" might not work > well, or that some people already have blobs in some different stores > (like HTTP servers, Docker registries, artifact stores, ...) and want > to fetch them from there as much as possible. Thanks for explaining the use cases - this makes sense, especially the last one which motivates the different modes for the "get" command (return raw bytes vs populating the Git repository with loose/packed objects). > And then letting people > use different stores can make clones or fetches restartable which > would solve another problem people have long been complaining about... This is unrelated to the rest of my e-mail, but out of curiosity, how would a different store make clones or fetches restartable? Do you mean that Git would invoke a "fetch" command through the ODB protocol instead of using its own native protocol? > >> +Furthermore many improvements that are dependent on specific setups > >> +could be implemented in the way Git objects are managed if it was > >> +possible to customize how the Git objects are handled. For example a > >> +restartable clone using the bundle mechanism has often been requested, > >> +but implementing that would go against the current strict rules under > >> +which the Git objects are currently handled. > > > > So in this example, you would use todays git-clone to obtain a small version > > of the repo and then obtain other objects later? > > The problem with explaining how it would work is that the > --initial-refspec option is added to git clone later in the patch > series. And there could be changes in the later part of the patch > series. So I don't want to promise or explain too much here. > But maybe I could add another patch to better explain that at the end > of the series. Such an explanation, in whatever form (patch or e-mail) would be great, because I'm not sure of the interaction between fetches and the connectivity check. The approach I have taken in my own patches [1] is to (1) declare that if a lazy remote supplies an object, it promises to have everything referred to by that object, and (2) we thus only need to check the objects not from the lazy remote. Translated to the ODB world, (1) is possible in the Microsoft case and is trivial in all the cases where the ODB provides only blobs (since blobs don't refer to any other object), and for (2), a "list" command should suffice. One constraint is that we do not want to obtain (from the remote) or store a separate list of what it has, to avoid the overhead. (I saw the --initial-refspec approach - that would not work if we want to avoid the overhead.) For fetches, we remember the objects obtained from that specific remote by adding a special file, name to be determined (I used ".imported" in [1]). (The same method is used to note objects lazily downloaded.) The repack command understands the difference between these two types of objects (patches for this are in progress). I'm not sure if this can be translated to the ODB world. The ODB can declare a special capability that fetch sends to the server in order to inform the server that it can exclude certain objects, and fetch can inform the ODB of the packfiles that it has written, but I'm not sure how the ODB can "remember" what it has. The ODB could mark such packs with ".managed" to note that it is managed by that ODB, so Git shoudn't touch it, but this means (for example) that Git can't GC them (and it seems also quite contradictory for an ODB to manage Git packfiles). [1] https://public-inbox.org/git/20170804145113.5ceafafa@xxxxxxxxxxxxxxxxxxxxxxxxxxx/