Re: How would Git chapter look like in "The Architecture of Open Source Applications"?

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 30 May 2011 12:30:10 +0200

On Mon, 30 May 2011, Jeff King <peff@xxxxxxxx> wrote:
> On Sat, May 28, 2011 at 02:17:38PM +0200, Jakub Narebski wrote:
> 
> > Among covered programs is Mercurial (chapter by Dirkjan Ochtman)...
> > but unfortunately no Git (they probably thought that one DVCS is enough).
> > 
> > How would such chapter on Git look like?  Authors of this book
> > encourage (among others) to write new chapters.
> 
> I just skimmed the Mercurial chapter, but they do cover a fair bit of
> general DVCS architecture. For git, I would guess a good approach would
> be to describe the data structures (i.e., content-addressable object
> database, DAG of commits, refs storing branches and tags), as everything
> else falls out from there. Most of the basic commands can be explained
> as "do some simple operation to the history graph or object db" and the
> more complex commands are compositions of the simple ones. So the
> architecture is really about having a data structure that represents the
> problem, exposing it to the user, and then building some niceties around
> the basic data structure operations.

The repository model that Git uses is quite well described in "Pro Git",
in "Discussion" section of git(1) manpage, in "Git concepts" section of
Git User's Manual and in gitcore-tutorial(7).

What I am more interested in is design *goals*, i.e. what's behind
choosing this and not other architecture.  

The chapter on Mercurial, in '12.2. Data Structures > 12.2.1. Challenges'
subsection says about limiting technology factors (quoting [Mac06]):
 * speed: CPU
 * capacity: disk and memory
 * bandwidth: memory, LAN, disk, and WAN
 * disk seek rate

This was for Mercurial; from what I remember from KernelTrap articles,
which covered beginnings of Git development quite well, and from other
sources, the main limiting factor considered was __speed__.  

Not disk space.  At first Git had only 'loose' format -- do you remember
"disk space is cheap" comment by Linus?  Admittedly Git used zlib
compression from very beginning (which works well for text).  IIRC at
first when _model_ that Git uses for repository was being drafted
LAN/WAN bandwidth wasn't consideration; AFAIK first transport that Git
used was nowadays deprecated rsync:// (UNIX philosophy of prototyping
and developing using existing ready tools, see [TAOUP], [Ben86]).

I think it was assumed that operating system would be good enough that
we don't have to worry about seek rates: Git is optimized for "hot cache"
case.  Note however that adoption of 'packed' format as on-disk format
was driven by speed (disk seek rate) as well as disk capacity i.e. 
reducing repository size.  Well, at least from what I remember.

The Mercurial's '12.2.1. Challenges' subsection continues from:

  The paper [i.e. [Mac06]] goes on to review common scenarios or
  criteria for evaluating the performance of such a system at
  the file level:

    * Storage compression: what kind of compression is best suited
      to save the file history on disk? Effectively, what algorithm
      makes the most out of the I/O performance while preventing
      CPU time from becoming a bottleneck?
    * Retrieving arbitrary file revisions: a number of version control
      systems will store a given revision in such a way that a large
      number of older revisions must be read to reconstruct the newer
      one (using deltas). We want to control this to make sure that
      retrieving old revisions is still fast.
    * Adding file revisions: we regularly add new revisions. We don't
      want to rewrite old revisions every time we add a new one, because
      that would become too slow when there are many revisions.
    * Showing file history: we want to be able to review a history of
      all changesets that touched a certain file. This also allows us
      to do annotations (which used to be called `blame` in CVS but was
      renamed to `annotate` in some later systems to remove the negative
      connotation): reviewing the originating changeset for each line
      currently in a file.

>From what *I* understand Linus approached the problem of DVCS design
from different direction: he is maintainer rather than ordinary developer,
and (from what he said) filesystem designer at heart, and not version
control developer.  Thus the common scenarios or criteria were different:

 * Merging and applying patches
 * Showing _subsystem_ history
 * ???

That is what I am interested in.

Some of Git history, and I think of motivations behind design, can be
found in "Git Chronicle" slides by Junio from GitTogether.

> Of course that's just my perspective. Linus might have written something
> totally different. :)

Well, only Linus can be definitive source of initial *design goals*
(behind core design of Git)...

References:
~~~~~~~~~~~
[Mac06]: Matt Mackall: "Towards a Better SCM: Revlog and Mercurial".
         2006 Ottawa Linux Symposium, 2006.
         http://selenic.com/mercurial/wiki/index.cgi/Presentations?action=AttachFile&do=get&target=ols-mercurial-paper.pdf
         (see also http://mercurial.selenic.com/wiki/Presentations)
[TAOUP]: Eric Raymond: "The Art of Unix Programming", 2003
         http://www.faqs.org/docs/artu/
         http://www.catb.org/~esr/writings/taoup/
[Ben86]: Jon Bentley: "Programming Pearls", chapter about implementing
         and prototyping UNIX 'spell' program (from Polish translation).
-- 
Jakub Narebski
Poland
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html