Re: [PATCH v5 35/40] Add Documentation/technical/external-odb.txt

Jonathan Tan <jonathantanmy@xxxxxxxxxx> · Fri, 25 Aug 2017 14:23:00 -0700

On Fri, 25 Aug 2017 08:14:08 +0200
Christian Couder <christian.couder@xxxxxxxxx> wrote:

> As Git is used by more and more by people having different needs, I
> think it is not realistic to expect that we can optimize its object
> storage for all these different needs. So a better strategy is to just
> let them store objects in external stores.
[snip]
> About these many use cases, I gave the "really big binary files"
> example which is why Git LFS exists (and which GitLab is interested in
> better solving), and the "really big number of files that are fetched
> only as needed" example which Microsoft is interested in solving. I
> could also imagine that some people have both big text files and big
> binary files in which case the "core.bigfilethreshold" might not work
> well, or that some people already have blobs in some different stores
> (like HTTP servers, Docker registries, artifact stores, ...) and want
> to fetch them from there as much as possible. 

Thanks for explaining the use cases - this makes sense, especially the
last one which motivates the different modes for the "get" command
(return raw bytes vs populating the Git repository with loose/packed
objects).

> And then letting people
> use different stores can make clones or fetches restartable which
> would solve another problem people have long been complaining about...

This is unrelated to the rest of my e-mail, but out of curiosity, how
would a different store make clones or fetches restartable? Do you mean
that Git would invoke a "fetch" command through the ODB protocol instead
of using its own native protocol?

> >> +Furthermore many improvements that are dependent on specific setups
> >> +could be implemented in the way Git objects are managed if it was
> >> +possible to customize how the Git objects are handled. For example a
> >> +restartable clone using the bundle mechanism has often been requested,
> >> +but implementing that would go against the current strict rules under
> >> +which the Git objects are currently handled.
> >
> > So in this example, you would use todays git-clone to obtain a small version
> > of the repo and then obtain other objects later?
> 
> The problem with explaining how it would work is that the
> --initial-refspec option is added to git clone later in the patch
> series. And there could be changes in the later part of the patch
> series. So I don't want to promise or explain too much here.
> But maybe I could add another patch to better explain that at the end
> of the series.

Such an explanation, in whatever form (patch or e-mail) would be great,
because I'm not sure of the interaction between fetches and the
connectivity check.

The approach I have taken in my own patches [1] is to (1) declare that
if a lazy remote supplies an object, it promises to have everything
referred to by that object, and (2) we thus only need to check the
objects not from the lazy remote. Translated to the ODB world, (1) is
possible in the Microsoft case and is trivial in all the cases where the
ODB provides only blobs (since blobs don't refer to any other object),
and for (2), a "list" command should suffice.

One constraint is that we do not want to obtain (from the remote) or
store a separate list of what it has, to avoid the overhead. (I saw the
--initial-refspec approach - that would not work if we want to avoid the
overhead.)

For fetches, we remember the objects obtained from that specific remote
by adding a special file, name to be determined (I used ".imported" in
[1]). (The same method is used to note objects lazily downloaded.) The
repack command understands the difference between these two types of
objects (patches for this are in progress).

I'm not sure if this can be translated to the ODB world. The ODB can
declare a special capability that fetch sends to the server in order to
inform the server that it can exclude certain objects, and fetch can
inform the ODB of the packfiles that it has written, but I'm not sure
how the ODB can "remember" what it has. The ODB could mark such packs
with ".managed" to note that it is managed by that ODB, so Git shoudn't
touch it, but this means (for example) that Git can't GC them (and it
seems also quite contradictory for an ODB to manage Git packfiles).

[1] https://public-inbox.org/git/20170804145113.5ceafafa@xxxxxxxxxxxxxxxxxxxxxxxxxxx/