Understanding Git Under The Hood: Trees

Erik Bernoth <erik.bernoth@xxxxxxxxx> · Thu, 15 Aug 2013 12:29:13 +0200

Hi,

I'm currently trying to understand the inner workings of git better by
writing a git clone in Python. I find it rather hard to understand how
to efficiently use trees.

What I understand is this: Trees are in essence blobs with a specific
content. The content is a relation between other blobs names, types
and modes. With these lines they can refer to other blobs and trees
and thus make a tine filesystem.

Now I constructed a system to read and write blobs and have an index
file that potentially references to a top tree object, which
represents the currently cached repository state. I can add and remove
files from the index manually, by creating the blob of the file and
working my way up adding or updating trees until I hit the one
referenced in INDEX. My algorithm to automate it is really ugly, error
prone and inefficient, though. I also have a hard time to find my way
around in C files, so maybe some people here in the list could explain
the algorithm in Git to me.

suppose we have the following Index:

INDEX
 -> tree
       -> tree "a/"
           -> blob "b.txt"
       -> tree "c/"
           -> blob "d.txt"

Now you want to stage a file with the following reference "a/e/g.txt".

One approach would be to walk top-down, splitting the path into its
elements and looking for the corresponding trees, either retrieving an
existing tree or creating a new one. Then finally create the blob
"g.txt" and be done with it. This seems rather inefficient, though,
because each created or updated tree means that all trees way back up
need to be updated as well, once for every step in the loop.

The other way is to go bottom-up, first creating the blob, then
creating trees up to the project root folder. But then I don't see a
way to find which tree elements already exist and need to be updated.

So the only algorithm I can come up with is this:
 1. walk down the tree with help of the path string to the tree that
is closest to the file I want to store. On the way remember all the
trees on the path from INDEX to the resulting file. (In the example
above I'd like to get the hash of the "a/" tree)
 2. create the blob (in the example with the context of g.txt)
 3. create the trees bottom-up until one step before the tree found in
1. (in the example create a "e/" tree, containing the "g.txt"'s blob)
 4. Add the resulting tree from 3. to the one found in 1. and create
the updated hash
 5. Now with help of the list from 1. walk the existing trees
bottom-up and update each one with the new hashes until INDEX is hit.
 6. Update INDEX.

Alltogether the idea of trees looked really simple and recursive which
makes me quite unhappy with the algorithm I came up with.

What is the algorithm to stage single files in Git itself?
How

Also: I thought to myself, why not just make a function that returns
the relative path to the repository root folder and consider that
string the file name, drop the idea of trees at all and add the
information that is traditionally stored in tree objects directly in a
commit object. Wouldn't that be much simpler and still accomplish the
same? I think the idea of keeping information in separate small files
instead of single big files was dropped at one point anyway, when
pack-files were introduced.

Cheers
Erik
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html