Hi everyone, On Sat, Nov 05, 2016 at 12:04:08AM +0100, Christian Couder wrote: > On Fri, Nov 4, 2016 at 10:19 PM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote: > > On Fri, Nov 04, 2016 at 09:47:41PM +0100, Christian Couder wrote: > >> > >> Couldn't a RefTree be used to store refs that point to the base > >> commit, the tip commit and the blob that contains the cover letter, > >> and maybe also a ref pointing to the RefTree of the previous version > >> of the series? > > > > That's really interesting! The Software Heritage project is working on > > something similar, because they want to store all the refs as part of > > their data model as well. I'll point them to the reftree work. > > Yeah, I know them :-) and I think I have already told Stefano > Zacchiroli about this, but I am not sure anymore. > Anyway I am CC'ing him. Thanks Christian (and Josh, on swh-devel) for pointing me to this. As a bit of background, the conceptual data model we have adopted for Software Heritage [1] is indeed that of a global Merkle DAG, very much inspired by Git, but where we deduplicate past the boundaries of individual VCS repositories. This way we can store only once the same software artifacts (blobs, trees, commits, etc.) even when they can be found at different software origins [2] (be it due to GitHub-like forks, projects moving around, or simply rogue copies of the same code scattered around the Internet). [1]: https://www.softwareheritage.org/ [2]: "software origin" is Software Heritage terminology, which just stands for places on the Internet where we can find source code In our original design the topmost entries in our Merkle hierarchy used to be commits and tags, similar to what Git does. But then we realized that doing so inhibited us from sharing entire repository states across multiple software origins or multiple visits of the same software origin. So we decided to add "repository snapshot objects" as our topmost entries, which are essentially git-like objects that map refs to the ID of the corresponding (typed-)objects. Rationale and a more lengthy description of this is available on our wiki [3]. It is not implemented yet, but we're pretty sold on the design at this point. [3]: https://wiki.softwareheritage.org/index.php?title=Repository_snapshot_objects Now, even if my only awareness of what's going on in Git upstream is limited to sporadic chats with Josh and Christian :-), it seems to me that various ideas in the Git ecosystem go in the same direction of our snapshot objects (git-series, RefTree). Which is understandable, given a number of use cases might be served by this. I don't think we have much to contribute to discussion or implementation here, and for our needs it doesn't really matter which one gets implemented. That's because we need an implementation of the concept which is *external* to Git anyhow. But even if it happens to exist within actual VCS, it's not a big deal for us, as we do have ways to distinguish "synthetic" objects in the DAG that we create for our own needs from "real" objects coming from actual software origins. (Another example of this concept we already have is when we inject distribution source packages or tarballs in our archive. In that case we create synthetic commits that points to the tree extracted from the tarball/package, preserving the ability to distinguish them from real commits coming from VCS out there.) If you think we can help in any other way, other than sharing our experiences and design considerations that is, please let me know! (I'm not subscribed to the Git upstream mailing list, but feel free to Cc:-me in conversations related to this topic.) Cheers. -- Stefano Zacchiroli . zack@xxxxxxxxxx . upsilon.cc/zack . . o . . . o . o Computer Science Professor . CTO Software Heritage . . . . . o . . . o o Former Debian Project Leader . OSI Board Director . . . o o o . . . o . « the first rule of tautology club is the first rule of tautology club »