On Mon, 30 May 2011, Jeff King <peff@xxxxxxxx> wrote: > On Sat, May 28, 2011 at 02:17:38PM +0200, Jakub Narebski wrote: > > > Among covered programs is Mercurial (chapter by Dirkjan Ochtman)... > > but unfortunately no Git (they probably thought that one DVCS is enough). > > > > How would such chapter on Git look like? Authors of this book > > encourage (among others) to write new chapters. > > I just skimmed the Mercurial chapter, but they do cover a fair bit of > general DVCS architecture. For git, I would guess a good approach would > be to describe the data structures (i.e., content-addressable object > database, DAG of commits, refs storing branches and tags), as everything > else falls out from there. Most of the basic commands can be explained > as "do some simple operation to the history graph or object db" and the > more complex commands are compositions of the simple ones. So the > architecture is really about having a data structure that represents the > problem, exposing it to the user, and then building some niceties around > the basic data structure operations. The repository model that Git uses is quite well described in "Pro Git", in "Discussion" section of git(1) manpage, in "Git concepts" section of Git User's Manual and in gitcore-tutorial(7). What I am more interested in is design *goals*, i.e. what's behind choosing this and not other architecture. The chapter on Mercurial, in '12.2. Data Structures > 12.2.1. Challenges' subsection says about limiting technology factors (quoting [Mac06]): * speed: CPU * capacity: disk and memory * bandwidth: memory, LAN, disk, and WAN * disk seek rate This was for Mercurial; from what I remember from KernelTrap articles, which covered beginnings of Git development quite well, and from other sources, the main limiting factor considered was __speed__. Not disk space. At first Git had only 'loose' format -- do you remember "disk space is cheap" comment by Linus? Admittedly Git used zlib compression from very beginning (which works well for text). IIRC at first when _model_ that Git uses for repository was being drafted LAN/WAN bandwidth wasn't consideration; AFAIK first transport that Git used was nowadays deprecated rsync:// (UNIX philosophy of prototyping and developing using existing ready tools, see [TAOUP], [Ben86]). I think it was assumed that operating system would be good enough that we don't have to worry about seek rates: Git is optimized for "hot cache" case. Note however that adoption of 'packed' format as on-disk format was driven by speed (disk seek rate) as well as disk capacity i.e. reducing repository size. Well, at least from what I remember. The Mercurial's '12.2.1. Challenges' subsection continues from: The paper [i.e. [Mac06]] goes on to review common scenarios or criteria for evaluating the performance of such a system at the file level: * Storage compression: what kind of compression is best suited to save the file history on disk? Effectively, what algorithm makes the most out of the I/O performance while preventing CPU time from becoming a bottleneck? * Retrieving arbitrary file revisions: a number of version control systems will store a given revision in such a way that a large number of older revisions must be read to reconstruct the newer one (using deltas). We want to control this to make sure that retrieving old revisions is still fast. * Adding file revisions: we regularly add new revisions. We don't want to rewrite old revisions every time we add a new one, because that would become too slow when there are many revisions. * Showing file history: we want to be able to review a history of all changesets that touched a certain file. This also allows us to do annotations (which used to be called `blame` in CVS but was renamed to `annotate` in some later systems to remove the negative connotation): reviewing the originating changeset for each line currently in a file. >From what *I* understand Linus approached the problem of DVCS design from different direction: he is maintainer rather than ordinary developer, and (from what he said) filesystem designer at heart, and not version control developer. Thus the common scenarios or criteria were different: * Merging and applying patches * Showing _subsystem_ history * ??? That is what I am interested in. Some of Git history, and I think of motivations behind design, can be found in "Git Chronicle" slides by Junio from GitTogether. > Of course that's just my perspective. Linus might have written something > totally different. :) Well, only Linus can be definitive source of initial *design goals* (behind core design of Git)... References: ~~~~~~~~~~~ [Mac06]: Matt Mackall: "Towards a Better SCM: Revlog and Mercurial". 2006 Ottawa Linux Symposium, 2006. http://selenic.com/mercurial/wiki/index.cgi/Presentations?action=AttachFile&do=get&target=ols-mercurial-paper.pdf (see also http://mercurial.selenic.com/wiki/Presentations) [TAOUP]: Eric Raymond: "The Art of Unix Programming", 2003 http://www.faqs.org/docs/artu/ http://www.catb.org/~esr/writings/taoup/ [Ben86]: Jon Bentley: "Programming Pearls", chapter about implementing and prototyping UNIX 'spell' program (from Polish translation). -- Jakub Narebski Poland -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html