50% speed-up of "git-apply && git-write-tree" sequence.

Junio C Hamano <junkio@xxxxxxx> · Sun, 23 Apr 2006 22:41:43 -0700

I've fixed-up the series and they are now sitting near the tip
of "pu" branch tonight.  With some more testing I am hoping to
merge it in "next" soon.

I should recap the overall idea for general public here.

When applying a series of patches to your working tree and
making commits out of the results, what happens is roughly this:

 1. Your index file exactly matches your HEAD commit; you may
    have local modifications to some paths in your working tree
    as long as they do not interfere with the patch application.

 2. "git-apply --index" reads the index file, reads the patch,
    applies them to both index and the working tree, and writes
    the updated index file out.

 3. "git-write-tree" reads the index file, writes the index
    entries out as a set of hierarchical tree objects, the top
    of which is wrapped in a new commit object that has the
    current HEAD as the parent, and the commit becomes the new
    HEAD.

You repeat 1-3 as many times as you apply patches.

When dealing with a huge tree, with a typical patch touching
only a handful paths, this is not very efficient.  A recent
kernel source tree has about 1200 directories, each of which
needs to be made into the "tree" object format in-core from the
data in the index file, and we hash it to see if we already have
that tree, and if not we write it out after compression (if we
do have it, we do not do anything after checking).

The updated code does this:

 (1) When git-write-tree writes out trees from the index, we
     store <directory, tree SHA1> pair for all the trees we
     compute, in .git/index.aux file.

 (2) When git-write-tree starts up, it reads the index file, and
     also reads the .git/index.aux file.  The latter file
     contains a fingerprint of the former, so if the index was
     updated without updating the index.aux, what is read is
     discarded.  However, if .git/index.aux is up-to-date, what
     it records can be reused when we scan the index.

 (3) There are a few commands that updates the index file.
     git-update-index is the most well-known one, but read-tree
     and apply also update it.

     No command other than apply updates index.aux when writing
     out index (yet).  However, git-apply has been told about
     it.  It reads index and index.aux file, and when it updates
     an index entry, it _invalidates_ the directory information
     it read from index.aux file, and writes the updated
     index.aux file out when it is done.

     Note.  Earlier I sent out a fix ([PATCH 5/4]) that
     contained a two-line removal from apply.c; the code was
     doing more than just invalidating entries, but actually
     recomputing the tree.  The fixed code does not hash tree
     information when git-apply writes index out.

 (4) When git-write-tree starts up after git-apply finishes its
     job, it notices index.aux file matches the fingerprint of
     the index (so it is up-to-date) and some entries have been
     _invalidated_.  When it writes out the index, the parts of
     the directory hierarchy that have not been invalidated do
     not have to be formatted in-core into "tree" -- we already
     know what tree object they are.  Only the parts that were
     invalidated need to be computed.

For this, a new data structure "cache_tree" is introduced.  The
way to use the API is very simple.  You initialize it like this:

	struct cache_tree *active_cache_tree;
	unsigned char sha1[20];
	read_cache_1(sha1);
        active_cache_tree = read_cache_tree(sha1);

The new function read_cache_1() is similar to the traditional
read_cache(), but it gives back the index file fingerprint.
This is given to read_cache_tree() to make sure index.aux is not
stale.  Then, when you touch a path in the cache, you invalidate
the paths in the cache_tree with:

	cache_tree_invalidate_path(active_cache_tree, path);

For example, if you are updating an index entry
"fs/ext3/inode.c", it invalidates the cached tree SHA1
information for top-level tree, "fs" tree and "fs/ext3" tree.
All other information you read from index.aux are still valid
(IOW, three out of 1200 are invalidated).

When you are done, you write out the index file and the updated
index.aux file like this:

        if (write_cache_1(newfd, active_cache, active_nr, active_cache_sha1) ||
            commit_index_file(&cache_file))
                die("Unable to write new cachefile");
        write_cache_tree(active_cache_sha1, active_cache_tree);

Again, write_cache_1() has one extra parameter to receive the
updated index file fingerprint, and you call write_cache_tree()
with it.  It updates index.aux file (with invalidated entries).

One thing write-tree does different is this (note that it does
not update the index file, it just reads from it):

	if (cache_tree_update(active_cache_tree, active_cache, active_nr,
			      missing_ok))
		die("git-write-tree: error building trees");
	write_cache_tree(active_cache_sha1, active_cache_tree);

The call to cache_tree_update() takes the (possibly partially
invalidated) cache_tree, fills in the invalidated parts from the
index file (this involves creation of new "tree" objects as
needed).  So the above call to write_cache_tree() writes a
complete index.aux file out without any invalidated entries.

Obviously, the change similar to what I did to apply.c in this
series can be done to other index-updating commands.  That's
"low hanging fruit" for the week ;-).

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html