Re: Curious about details of optimization of object database...

"Boyd Stephen Smith Jr." <bss@xxxxxxxxxxxxxxxxx> · Fri, 9 Jan 2009 13:07:21 -0600

On Friday 2009 January 09 11:46:23 chris@xxxxxxxxxxxx wrote:
>I'm told a commit is *not* a patch (diff), but, rather a copy of the entire
>tree.

It's even more than that.  A commit object contains its message, the SHA of 
the tree, and zero or more SHAs for its parents.

>Can anyone say, in a few sentences, how git avoids needing to keep multiple
>slightly different copies of entire files without just storing lots of
>patches/diffs?

Loose objects can have large swaths of duplicated data.  However, git also 
supports storing objects in a packed format, which uses delta compression to 
reduce the duplication to close to nothing.

Some examples:
Sizes are from "du -sh .git ."; The .git directory stores all the objects as 
well as the repository configuration, refs, reflogs, etc.  The . directory 
has .git and a clean checkout of master.

The LinuxPMI (http://linuxpmi.org/) tree:
41M     .git
83M     .
(So, the storage is actually a bit smaller than the checkout; 984 objects; 140 
commits)

A small project between me an my flatmates:
309K    .git
3.6M    .
(Here, the storage is significantly smaller than the checkout; 786 objects; 
155 commits)

My repository that tracks my dotfiles:
124K    .git
176K    .
(113 objects; 28 commits)
-- 
Boyd Stephen Smith Jr.                     ,= ,-_-. =. 
bss@xxxxxxxxxxxxxxxxx                     ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy           `-'(. .)`-' 
http://iguanasuicide.net/                      \_/     
Attachment:
signature.asc

Description: This is a digitally signed message part.