Re: Git Miniconference at Plumbers

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 14 Sep 2016 10:26:40 -0700

Lars Schneider <larsxschneider@xxxxxxxxx> writes:

> Some applications have test data, image assets, and other data sets that
> need to be versioned along with the source code.
>
> How would you deal with these kind of "huge objects" _today_?

When you know that you'd find the answer to that question totally
uninteresting, why do you even bother to ask? ;-)

I don't, and if I had to, I would deal with them just like any other
objects.

A more interesting pair of questions to ask would be what the
fundamental requirement for an acceptable solution is, and what
solution within the constraint I would envision, if I were given a
group competent Git hackers and enough time to realize it.

The most important constraint is that any acceptable solution should
preserve the object identity.

And starting from a "I don't but if I had to..." repository that is
created in a dumb way, a solution that satisifies the constraint may
work like this, requiring enhancements to various parts of the
system:

 - The "upload-pack" protocol would allow the owner of a such
   repository and the party that "git clone"'s from there to
   negotiate:

    . what it means for a object to be "huge" (e.g. the owner may
      implicitly show the preference by marking a packfile as
      containing such "huge" objects, may configure that blobs that
      appear at paths that match certain glob pattern are "huge", or
      the sender and the receiver may say objects that are larger
      than X MB are "huge", etc.); and

    . what to do with "huge" objects (e.g. the receiver may ask for
      a full clone, or the receiver may ask to omit "huge" ones from
      the initial transfer)

 - The "upload-pack" protocol would give, in addition to the normal
   pack stream that conveys only non-"huge" objects, for each of
   "huge" objects that are not transferred, what its object name is
   and how it can later be retrieved.

 - Just like packing objects in packfiles was added as a different
   implementation to store objects in the object database that is
   better than storing them individually as loose object files,
   there will be a third way to store such "huge" object _in_ the
   object database, which may actually not _store_ them locally at
   all.  The local object store may merely have placeholders for
   them, in which instructions for how it can be acquired when
   necessary are stored.  The extra information sent over the
   "upload-pack" protocol for "huge" objects with the previous
   bullet-point are used to store these objects in this "third" way.

 - A new mechanism would allow such objects that are stored in this
   "third" way to be retrieved lazily or on-demand.

There are other enhancements whose necessity will fall naturally out
of such a lazy scheme outlined above.  E.g. "fsck" needs to learn
that the objects stored in the third way are considered to "exist"
but their actual contents is not expected to be verifiable until
they are retrieved. "send-pack" (i.e. running "git push" from a
repository cloned with the procedure outlined above) needs to treat
the objects stored in the third way differently (most likely, it
will fail a request for full-clone and send "not here, but you can
get it this way" for them).  Local operations that need more than
object names need to learn reasonable fallback behaviours to work
when the actual object contents are not yet available (e.g. all of
them may offer "this is not yet available; do you want to get
on-demand?" or there may even be "object.ondemand" configuration
option to skip the end-user interaction.  When on-demand retrieving
is not done, "git archive" may place a placeholder file in its
output that says "no data (yet) here", "git log --raw" may show the
object name but "git log -p" may say "insufficient data to produce a
patch", etc.) [*1*].

Because we start from the "object identity should not change", you
do not have to make a decision upfront when preparing the ultimate
source of the truth.  When you take a clone-network of a single
project as a whole, somebody needs to hold the entire set of objects
somewhere, and many of the repository in the clone-network may have
"huge" objects in the third "not here yet, here is how to get it"
form.  As the system improves, and as the networking and storage
technology changes, the definition of "huge" WILL change over time
and those repositories can turn the ones that used to be "huge" into
normal objects.

If you use approaches taken by various clean/smudge based current
crop of solutions [*2*], on the other hand, once you decide a blob
object is "huge" and needs to be replaced with a surrogate (to be
instantiated via the "clean" filter), the "huge" object _has_ to
stay in the surrogate form in the containing tree and you cannot
change the division between "huge" and "normal" ever without
rewriting the history.

[Footnote]

*1* Astute readers would realize that the utility of such a "third
    way" object storage mechanism is not limited to "keep and
    transfer huge objects lazily".  The same mechanism can say "not
    yet here, and there is no way for _you_ to retrieve the
    contents", which is an effective way to "obliterate" an object.

*2* I called them "hacks" because they are practical compromise that
    can be done with today's Git, while sidestepping harder problems
    that are needed to be solved to realize the solution outlined
    above.