Re: [PATCH 3/4] peel_ref: check object type before loading

Jeff King <peff@xxxxxxxx> · Thu, 4 Oct 2012 17:59:33 -0400

On Thu, Oct 04, 2012 at 01:41:40PM -0700, Junio C Hamano wrote:

> Jeff King <peff@xxxxxxxx> writes:
> 
> > [1] One thing I've been toying is with "external alternates"; dumping
> >     your large objects in some realtively slow data store (e.g., a
> >     RESTful HTTP service). You could cache and cheaply query a list of
> >     "sha1 / size / type" for each object from the store, but getting the
> >     actual objects would be much more expensive. But again, it would
> >     depend on whether you would actually have such a store directly
> >     accessible by a ref.
> 
> Yeah, that actually has been another thing we were discussing
> locally, without coming to something concrete enough to present to
> the list.
> 
> The basic idea is to mark such paths with attributes, and use a
> variant of smudge/clean filter that is _not_ a filter (as we do not
> want to have the interface to this external helper to be "we feed
> the whole big blob to you").  Instead, these smudgex/cleanx things
> work on a pathname.

There are a few ways that smudge/clean filters are not up to the task
currently. One is definitely the efficiency of the interface.

Another is that this distinction is not "canonical in-repo versus
working tree representation".  Yes, part of it is about what you store
in the repo versus what you checkout. But you may also want to diff or
merge these large items. Almost certainly not with the internal text
tools, but you might want to feed the _real_ blobs (not the fake
surrogates) to external tools, and smudge and clean are not a natural
fit for that. If you teach the external tools to dereference the
surrogates, it would work. I think you could also get around it by
giving these smudgex/cleanx filters slightly different semantics.

But the thing I most dislike about a solution like this is that the
representation pollutes git's data model. That is, git stores a blob
representing the surrogate, and that is the official sha1 that goes into
the tree. That means your surrogate decisions are locked into history
forever, which has a few implications:

  1. If I store a surrogate and you store the real blob, we do not get
     the same tree (and in fact get conflicts).

  2. Once I store a blob, I can never revise my decision not to store
     that blob in every subsequent clone without rewriting history. I
     can convert it into a surrogate, but old versions of that blob will
     always be required for object connectivity.

  3. If your surrogate contains more than just the sha1 (e.g., if it
     points to http://cdn.example.com), your history is forever tied to
     that (and you are stuck with configuring run-time redirects if your
     URL changes).

My thinking was to make it all happen below the git object level, at the
same level as packs and loose objects. This is more invasive to git, but
much more flexible.

The basic implementation is pretty straightforward. You configure one or
more helpers which have two operations: provide a list of sha1s, and
fetch a single sha1. Whenever an object lookup fails in the packs and
loose objects, we check the helper lists.  We can cache the lists in
some mmap'able format similar to a pack index, so we can do quick checks
for object existence, type, and size. And if the lookup actually needs
the object, we fetch and cache (where the caching policy would all be
determined by the external script).

The real challenges are:

  1. It is expensive to meet normal reachability guarantees. So you
     would not want a remote to delta against a blob that you _could_
     get, but do not have.

  2. You need to tell remotes that you can access some objects in a
     different way, and not to send them as part of an object transfer.

  3. Many commands need to be taught not to load objects carelessly. For
     the most part, we do well with this because it's already expensive
     to load large objects from disk. I think we've got most of them on
     the diff code path, but I wouldn't be surprised if one or two more
     crop up. Fsck would need to learn to handle these objects
     differently.

Item (3) is really just about trying it and seeing where the problems
come up. For items (1) and (2), we'd need a protocol extension.

Having the receiver send something like "I have these blobs from my external
source, don't send them" is a nice idea, but it doesn't scale well (you
have to advertise the whole list for each transfer, because the receiver
doesn't know which ones are actually referenced).

Having the receiver say "I have access to external database $FOO, go check
the list of objects there and don't send anything it has" unnecessarily
ties the sender to the external database (i.e., they have to implement
$FOO for it to work).

Something simple like "do not send blobs larger than N bytes, nor make
deltas against such blobs" would work. It does mean that your value of N
really needs to match up with what goes into your external database, but
I don't think that will be a huge problem in practice (after the
transfer you can do a consistency check that between the fetched objects
and the external db, you have everything).

> [details on smudgex/cleanx]

All of what you wrote seems very sane; I think the real question is
whether this should be "above" or "below" git's user-visible data model
layer.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html