Re: Git Large Object Support Proposal

Scott Chacon <schacon@xxxxxxxxx> · Thu, 19 Mar 2009 17:19:15 -0700

Hey,

On Thu, Mar 19, 2009 at 5:11 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> david@xxxxxxx writes:
>
>> On Thu, 19 Mar 2009, Junio C Hamano wrote:
>>
>>> Scott Chacon <schacon@xxxxxxxxx> writes:
>>>
>>>> The point is that we don't keep this data as 'blob's - we don't try to
>>>> compress them or add the header to them, they're too big and already
>>>> compressed, it's a waste of time and often outside the memory
>>>> tolerance of many systems. We keep only the stub in our db and stream
>>>> the large media content directly to and from disk.  If we do a
>>>> 'checkout' or something that would switch it out, we could store the
>>>> data in '.git/media' or the equivalent until it's uploaded elsewhere.
>>>
>>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic
>>> links that you keep track of, and let other people (e.g. rsync) deal with
>>> the complexity of managing that side of the world.
>>>
>>> And I think you can start experimenting it without any change to the core
>>> datastructures.  In your single-page web site in which its sole html file
>>> embeds an mpeg movie, you keep track of these two things in git:
>>>
>>>      porn-of-the-day.html
>>>        porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg
>>>
>>> and any time you want to feed a new movie, you update the symlink to a
>>> different one that lives outside the source-controlled tree, while
>>> arranging the link target to be updated out-of-band.

It seems like the main problem here would be that most operations in
the working directory would be overwriting not the symlink but the
file it points to.  If you do a simple 'cp ~/generated_file.mpg
porn-of-the-day.mpg' (to upload your newest and bestest porn), it will
overwrite the '../media/6066f5ae75ec.mpg' file, not the symlink so
that we can generate a new symlink.  Then if we haven't uploaded the
'../media/6066f5ae75ec.mpg' file anywhere yet, it's a goner.  Right?
What you are proposing is almost exactly what I want to do, but I'm
concerned with this issue of the symlink reference not working right
for normal working directory operations.  If a file is never
overwritten, however, this is basically identical to what I wanted to
do.

Scott

>>
>> that would work, but the proposed change has some advantages
>>
>> 1. you store the sha1 of the real mpg in the 'large file' blob so you
>> can detect problems
>
> You store the unique identifier of the real mpg in the symbolic link
> target which is a blob payload, so you can detect problems already.  I
> deliberately said "unique identifier"; you seem to think saying SHA-1
> brings something magical but I do not think it needs to be even blob's
> SHA-1.  Hashing that much data costs.
>
> In any case, you can have a script (or client-side hook) that does:
>
>    (1) find the out-of-tree symlinks in the index (or in the work tree);
>
>    (2) if it is dangling, and if you have definition of where to get that
>        hierarchy from (e.g ../media), run rsync or wget or whatever
>        external means to grab it.
>
> and call it after "git pull" updates from some other place.  The "git
> media" of Scott's message could be an alias to such a command.
>
> Adding a new type "external-blob" would be an unwelcome pain.  Reusing
> "blob" so that existing "blob" codepath now needs to notice special "0"
> that is not length "0" is even bigger pain than that.
>
> And that is a pain for unknown benefit, especially when you can start
> experimenting without any changes to the existing data structure.  In the
> worst case, the experiment may not pan out as well as you hoped and if
> that is the end of the story, so be it.  It is not a great loss.  If it
> works well enough and we can have the external large media support without
> any changes to the data structure, that would be really great.  If it
> sort-of works but hits limitation, we can analyze how best to overcome
> that limitation, and at that time it _might_ turn out to be the best
> approach to introduce a new blob type.
>
> But I do not think we know that yet.
>
> In the longer run, as you speculated in your message, I think the native
> blob codepaths need to be updated to tolerate a large, unmappable objects
> better.  With that goal in mind, I think it is a huge mistake to
> prematurely introduce an arbitrary distinct "blob" and "large blob" types,
> if in the end they need to be merged back again; it would force the future
> code indefinitely to care about the historical "large blob" types that was
> once supported.
>
>> 2. since it knows the sha1 of the real file, it can auto-create the
>> real file as needed, without wasting space on too many copies of it.
>
> Hmm, since when SHA-1 is reversible?
>
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html