Lars Schneider <larsxschneider@xxxxxxxxx> writes: > Some applications have test data, image assets, and other data sets that > need to be versioned along with the source code. > > How would you deal with these kind of "huge objects" _today_? When you know that you'd find the answer to that question totally uninteresting, why do you even bother to ask? ;-) I don't, and if I had to, I would deal with them just like any other objects. A more interesting pair of questions to ask would be what the fundamental requirement for an acceptable solution is, and what solution within the constraint I would envision, if I were given a group competent Git hackers and enough time to realize it. The most important constraint is that any acceptable solution should preserve the object identity. And starting from a "I don't but if I had to..." repository that is created in a dumb way, a solution that satisifies the constraint may work like this, requiring enhancements to various parts of the system: - The "upload-pack" protocol would allow the owner of a such repository and the party that "git clone"'s from there to negotiate: . what it means for a object to be "huge" (e.g. the owner may implicitly show the preference by marking a packfile as containing such "huge" objects, may configure that blobs that appear at paths that match certain glob pattern are "huge", or the sender and the receiver may say objects that are larger than X MB are "huge", etc.); and . what to do with "huge" objects (e.g. the receiver may ask for a full clone, or the receiver may ask to omit "huge" ones from the initial transfer) - The "upload-pack" protocol would give, in addition to the normal pack stream that conveys only non-"huge" objects, for each of "huge" objects that are not transferred, what its object name is and how it can later be retrieved. - Just like packing objects in packfiles was added as a different implementation to store objects in the object database that is better than storing them individually as loose object files, there will be a third way to store such "huge" object _in_ the object database, which may actually not _store_ them locally at all. The local object store may merely have placeholders for them, in which instructions for how it can be acquired when necessary are stored. The extra information sent over the "upload-pack" protocol for "huge" objects with the previous bullet-point are used to store these objects in this "third" way. - A new mechanism would allow such objects that are stored in this "third" way to be retrieved lazily or on-demand. There are other enhancements whose necessity will fall naturally out of such a lazy scheme outlined above. E.g. "fsck" needs to learn that the objects stored in the third way are considered to "exist" but their actual contents is not expected to be verifiable until they are retrieved. "send-pack" (i.e. running "git push" from a repository cloned with the procedure outlined above) needs to treat the objects stored in the third way differently (most likely, it will fail a request for full-clone and send "not here, but you can get it this way" for them). Local operations that need more than object names need to learn reasonable fallback behaviours to work when the actual object contents are not yet available (e.g. all of them may offer "this is not yet available; do you want to get on-demand?" or there may even be "object.ondemand" configuration option to skip the end-user interaction. When on-demand retrieving is not done, "git archive" may place a placeholder file in its output that says "no data (yet) here", "git log --raw" may show the object name but "git log -p" may say "insufficient data to produce a patch", etc.) [*1*]. Because we start from the "object identity should not change", you do not have to make a decision upfront when preparing the ultimate source of the truth. When you take a clone-network of a single project as a whole, somebody needs to hold the entire set of objects somewhere, and many of the repository in the clone-network may have "huge" objects in the third "not here yet, here is how to get it" form. As the system improves, and as the networking and storage technology changes, the definition of "huge" WILL change over time and those repositories can turn the ones that used to be "huge" into normal objects. If you use approaches taken by various clean/smudge based current crop of solutions [*2*], on the other hand, once you decide a blob object is "huge" and needs to be replaced with a surrogate (to be instantiated via the "clean" filter), the "huge" object _has_ to stay in the surrogate form in the containing tree and you cannot change the division between "huge" and "normal" ever without rewriting the history. [Footnote] *1* Astute readers would realize that the utility of such a "third way" object storage mechanism is not limited to "keep and transfer huge objects lazily". The same mechanism can say "not yet here, and there is no way for _you_ to retrieve the contents", which is an effective way to "obliterate" an object. *2* I called them "hacks" because they are practical compromise that can be done with today's Git, while sidestepping harder problems that are needed to be solved to realize the solution outlined above.