Jeff King <peff@xxxxxxxx> writes: > [1] One thing I've been toying is with "external alternates"; dumping > your large objects in some realtively slow data store (e.g., a > RESTful HTTP service). You could cache and cheaply query a list of > "sha1 / size / type" for each object from the store, but getting the > actual objects would be much more expensive. But again, it would > depend on whether you would actually have such a store directly > accessible by a ref. Yeah, that actually has been another thing we were discussing locally, without coming to something concrete enough to present to the list. The basic idea is to mark such paths with attributes, and use a variant of smudge/clean filter that is _not_ a filter (as we do not want to have the interface to this external helper to be "we feed the whole big blob to you"). Instead, these smudgex/cleanx things work on a pathname. - Your in-tree objects store a blob that records a description of the large thing. Call such a blob a surrogate. "clone", "fetch" and "push" all deal only with surrogates so your in-history data will stay small. - When checking out, the attributes mechanism kicks in and runs the "not filter" variant of smudge with the data in the surrogate. The surrogate records how to get the real thing from where, and how to validate what you got is correct. A hand-wavy example may look like this: get: download http://cdn.example.com/67def20 sha1sum: f84667def209e4a84e37e8488a08e9eca3f208c1 to tell you to download a single URL with whatever means suitable for your platform (perhaps curl or wget), and verify the result by running sha1sum. Or it may involve get: git-fetch git://git.example.com/images.git/ master object: 85a094f22f02c54c740448f6716da608a5e89a80 to tell you to "git fetch" from the given git-reachable resource into some place and grab the object via "git cat-file", possibly streaming it out. The details do not matter at this point in the design process. The smudgex helper is responsible for caching previously fetched large contents, maintaining association between the surrogate blob and its real data, so that once the real thing is downloaded, and the contents of the path needs to change to something else (e.g. user checks out a different branch) and then change to the previous thing again (e.g. user comes back to the original branch), it does not download it again. - When checking if the working tree is clean relative to the index, the smudgex/cleanx helper will be consulted. It will be given the surrogate data in the index and the path in the working tree. We may want to allow the helper implementation to give a read-only hardlink directly into helper's cache storage, so that it can consult its database of surrogate-to-real mapping and perform this verification cheaply by inode comparison, or something. - When running "git add" a modified large stuff prepared in the working tree, cleanx helper is called to prepare a new surrogate, and that is what is registered in the index. The helper is also responsible for storing the new large stuff away and arrange it to be retrievable when others see and use this surrogate. The initial scope of supporting something like that in core-git would be to add the necessary infrastracture to arrange such smudgex and cleanx helpers are called when a path is marked as a surrogate in the attribute system, and supply a sample helper. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html