Re: Understanding Git Under The Hood: Trees

Andreas Ericsson <ae@xxxxxx> · Thu, 15 Aug 2013 14:40:51 +0200

On 2013-08-15 12:29, Erik Bernoth wrote:
Hi,

I'm currently trying to understand the inner workings of git better by
writing a git clone in Python. I find it rather hard to understand how
to efficiently use trees.

What I understand is this: Trees are in essence blobs with a specific
content. The content is a relation between other blobs names, types
and modes. With these lines they can refer to other blobs and trees
and thus make a tine filesystem.

Now I constructed a system to read and write blobs and have an index
file that potentially references to a top tree object, which
represents the currently cached repository state. I can add and remove
files from the index manually, by creating the blob of the file and
working my way up adding or updating trees until I hit the one
referenced in INDEX. My algorithm to automate it is really ugly, error
prone and inefficient, though. I also have a hard time to find my way
around in C files, so maybe some people here in the list could explain
the algorithm in Git to me.

suppose we have the following Index:

INDEX
  -> tree
        -> tree "a/"
            -> blob "b.txt"
        -> tree "c/"
            -> blob "d.txt"

Now you want to stage a file with the following reference "a/e/g.txt".

One approach would be to walk top-down, splitting the path into its
elements and looking for the corresponding trees, either retrieving an
existing tree or creating a new one. Then finally create the blob
"g.txt" and be done with it. This seems rather inefficient, though,
because each created or updated tree means that all trees way back up
need to be updated as well, once for every step in the loop.

The other way is to go bottom-up, first creating the blob, then
creating trees up to the project root folder. But then I don't see a
way to find which tree elements already exist and need to be updated.

So the only algorithm I can come up with is this:
  1. walk down the tree with help of the path string to the tree that
is closest to the file I want to store. On the way remember all the
trees on the path from INDEX to the resulting file. (In the example
above I'd like to get the hash of the "a/" tree)
  2. create the blob (in the example with the context of g.txt)
  3. create the trees bottom-up until one step before the tree found in
1. (in the example create a "e/" tree, containing the "g.txt"'s blob)
  4. Add the resulting tree from 3. to the one found in 1. and create
the updated hash
  5. Now with help of the list from 1. walk the existing trees
bottom-up and update each one with the new hashes until INDEX is hit.
  6. Update INDEX.

Alltogether the idea of trees looked really simple and recursive which
makes me quite unhappy with the algorithm I came up with.

What is the algorithm to stage single files in Git itself?

You seem to believe that the in-memory representation of trees have to
be the same as the on-disk one. That's simply not true. Git cheats
outrageously with internal formats for pretty much everything in order
to squeeze out more performance.

You also seem to believe that a tree is more than one directory.
That's not true either.

So... Each tree that's updated can (and does) have an in-memory link
to its parent directory. Whenever we update a directory and create a
commit from it, we make sure to write them out from right to left (ie, 
children before parents), so that we're never in a state where a tree on 
disk can't find any of its content blobs in the same object storage.

With that information in hand, I'm sure you can quite easily create an
algorithm that turns out a bit prettier than what you have now.

Also: I thought to myself, why not just make a function that returns
the relative path to the repository root folder and consider that
string the file name, drop the idea of trees at all and add the
information that is traditionally stored in tree objects directly in a
commit object.

Because commit objects must have parents and things like that. Also,
the idea of trees containing trees saves a *huge* amount of disk I/O
and space when updating a project with many subdirectories.

Wouldn't that be much simpler and still accomplish the
same?

Simpler; Yes. More efficient; No.

The commit that changed the tree structure from flat to hierarchical
dates back to April of 2005. It's commit number 25 in the history of
git and solves one of the first real problems from live-testing git
on the kernel repository, which was that tree-files got so huge when
they had to be rewritten in their entirety that creating a new commit
took several seconds.

I think the idea of keeping information in separate small files
instead of single big files was dropped at one point anyway, when
pack-files were introduced.

Not really. We strive hard to minimize disk I/O whenever we can. The
packfiles actually reduce I/O, since it lets the kernel use the disk's
builtin caches a lot better, and read-ahead works to our advantage.

Btw, when I say "minimize disk I/O", I really mean "minimize user wait",
although disk I/O is certainly (often) the largest part of that.

--
Andreas Ericsson                   andreas.ericsson@xxxxxx
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html