Hey, On Thu, Mar 19, 2009 at 5:11 PM, Junio C Hamano <gitster@xxxxxxxxx> wrote: > david@xxxxxxx writes: > >> On Thu, 19 Mar 2009, Junio C Hamano wrote: >> >>> Scott Chacon <schacon@xxxxxxxxx> writes: >>> >>>> The point is that we don't keep this data as 'blob's - we don't try to >>>> compress them or add the header to them, they're too big and already >>>> compressed, it's a waste of time and often outside the memory >>>> tolerance of many systems. We keep only the stub in our db and stream >>>> the large media content directly to and from disk. If we do a >>>> 'checkout' or something that would switch it out, we could store the >>>> data in '.git/media' or the equivalent until it's uploaded elsewhere. >>> >>> Aha, that sounds like you can just maintain a set of out-of-tree symbolic >>> links that you keep track of, and let other people (e.g. rsync) deal with >>> the complexity of managing that side of the world. >>> >>> And I think you can start experimenting it without any change to the core >>> datastructures. In your single-page web site in which its sole html file >>> embeds an mpeg movie, you keep track of these two things in git: >>> >>> porn-of-the-day.html >>> porn-of-the-day.mpg -> ../media/6066f5ae75ec.mpg >>> >>> and any time you want to feed a new movie, you update the symlink to a >>> different one that lives outside the source-controlled tree, while >>> arranging the link target to be updated out-of-band. It seems like the main problem here would be that most operations in the working directory would be overwriting not the symlink but the file it points to. If you do a simple 'cp ~/generated_file.mpg porn-of-the-day.mpg' (to upload your newest and bestest porn), it will overwrite the '../media/6066f5ae75ec.mpg' file, not the symlink so that we can generate a new symlink. Then if we haven't uploaded the '../media/6066f5ae75ec.mpg' file anywhere yet, it's a goner. Right? What you are proposing is almost exactly what I want to do, but I'm concerned with this issue of the symlink reference not working right for normal working directory operations. If a file is never overwritten, however, this is basically identical to what I wanted to do. Scott >> >> that would work, but the proposed change has some advantages >> >> 1. you store the sha1 of the real mpg in the 'large file' blob so you >> can detect problems > > You store the unique identifier of the real mpg in the symbolic link > target which is a blob payload, so you can detect problems already. I > deliberately said "unique identifier"; you seem to think saying SHA-1 > brings something magical but I do not think it needs to be even blob's > SHA-1. Hashing that much data costs. > > In any case, you can have a script (or client-side hook) that does: > > (1) find the out-of-tree symlinks in the index (or in the work tree); > > (2) if it is dangling, and if you have definition of where to get that > hierarchy from (e.g ../media), run rsync or wget or whatever > external means to grab it. > > and call it after "git pull" updates from some other place. The "git > media" of Scott's message could be an alias to such a command. > > Adding a new type "external-blob" would be an unwelcome pain. Reusing > "blob" so that existing "blob" codepath now needs to notice special "0" > that is not length "0" is even bigger pain than that. > > And that is a pain for unknown benefit, especially when you can start > experimenting without any changes to the existing data structure. In the > worst case, the experiment may not pan out as well as you hoped and if > that is the end of the story, so be it. It is not a great loss. If it > works well enough and we can have the external large media support without > any changes to the data structure, that would be really great. If it > sort-of works but hits limitation, we can analyze how best to overcome > that limitation, and at that time it _might_ turn out to be the best > approach to introduce a new blob type. > > But I do not think we know that yet. > > In the longer run, as you speculated in your message, I think the native > blob codepaths need to be updated to tolerate a large, unmappable objects > better. With that goal in mind, I think it is a huge mistake to > prematurely introduce an arbitrary distinct "blob" and "large blob" types, > if in the end they need to be merged back again; it would force the future > code indefinitely to care about the historical "large blob" types that was > once supported. > >> 2. since it knows the sha1 of the real file, it can auto-create the >> real file as needed, without wasting space on too many copies of it. > > Hmm, since when SHA-1 is reversible? > -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html