Re: Git Large Object Support Proposal

Scott Chacon <schacon@xxxxxxxxx> · Thu, 19 Mar 2009 16:18:54 -0700

Hey,

On Thu, Mar 19, 2009 at 3:31 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Scott Chacon <schacon@xxxxxxxxx> writes:
>
>> But where Git instead stores a stub object and the large binary object
>> is pulled in via a separate mechanism. I was thinking that the client
>> could set a max file size and when a binary object larger than that is
>> staged, Git instead writes a stub blob like:
>>
>> ==
>> blob [size]\0
>> [sha of large blob]
>> ==
>
> An immediate pair of questions are, if you can solve the issue by
> delegating large media to somebody else (i.e. "media server"), and that
> somebody else can solve the issues you are having, (1) what happens if you
> lower that "large" threashold to "0 byte"?  Does that somebody else still
> work fine, and does the git that uses indirection also still work fine?
> If so why are you using git instead of that somebody else altogether?  and

In theory it would work fine, where all the commits/trees are
transferred over git and all the blobs are basically stored elsewhere,
but I would assume it would be much slower for the end user and so
nobody would do that.  I would imagine users would only use/enable
this at all if they have large media files that they don't want to
have every version of which cloned every time.  I can't imagine that
this would be used at all by more than a small percentage of users,
but when large media does need to be in source code, they will not use
Git (they will use Perforce or SVN), or they will put it in there and
then kill their (or our) servers when upload-pack tries to mmap it
(twice, yes?).  I thought it would be much more efficient for Git to
have the ability to simply mark files that don't make sense to pack up
and be able to keep track of and transfer them via a more appropriate
protocol.

> (2) what prevents us from stealing the trick that somebody else uses so
> that git itself can natively handle large blobs without indirection?
>

Actually, I'm fine with that - phase two of this project, if it made
sense at all, would be to have another set of git transfer commands
that allowed large blobs to be uploaded/downloaded separately,
importantly not passing them in the packfile and keeping them loose,
uncompressed and headerless on disk so they can simply be streamed
when requested.  I am thinking entirely about movies and images that
are already compressed and there is simply no need to load them
entirely into memory.  I simply thought that taking advantage of
services that already did this (scp, sftp, s3) would be quicker than
building another set of transfer protocols into Git.

> Without thinking the ramifications through myself, this sounds pretty much
> like a band-aid and will nend up hitting the same "blob is larger than we
> can handle" issue when you follow the indirection eventually, but that is
> just my gut feeling.

The point is that we don't keep this data as 'blob's - we don't try to
compress them or add the header to them, they're too big and already
compressed, it's a waste of time and often outside the memory
tolerance of many systems. We keep only the stub in our db and stream
the large media content directly to and from disk.  If we do a
'checkout' or something that would switch it out, we could store the
data in '.git/media' or the equivalent until it's uploaded elsewhere.

>
> This is an off-topic "By the way", but has another topic addressed to you
> on git-scm.com/about resolved in any way yet?
>

Thanks for pointing that out, I missed that thread.  I actually just
pushed out some changes over the last few days - I added the Gnome
project since they just announced they're moving to Git, added a link
to the new OReilly book that just was released and I pulled in some
validation and other misc changes that had been contributed.

Currently I have to re-gen the Authors data manually, so I do it every
once in a while - I just pushed up new data.  Doing it per release is
a good idea, I'll try to get that in the release script.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html