Re: inotify daemon speedup for git [POC/HACK]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jul 27, 2010, at 5:18 PM, Avery Pennarun wrote:

On Tue, Jul 27, 2010 at 8:00 PM, Shawn O. Pearce <spearce@xxxxxxxxxxx> wrote:
Avery Pennarun <apenwarr@xxxxxxxxx> wrote:
While we're here, it's probably worth mentioning that git's index file
format (which stores a sequential list of full paths in alphabetical
order, instead of an actual hierarchy) does become a bottleneck when
you actually have a huge number of files in your repo (like literally a million). You can't actually binary search through the index! The
current implementation of submodules allows you to dodge that
scalability problem since you end up with multiple smaller index
files.  Anyway, that's fixable too.

Yes.

More than once I've been tempted to rewrite the on-disk (and I guess
in-memory) format of the index.  And then I remember how painful that
stuff is in either C git.git or JGit, and I back away slowly.  :-)

Ideally the index is organized the same way the trees are, but
you still can't do a really good binary search because of the
ass-backwards name sorting rule for trees.  But for performance
reasons you still want to keep the entire index in a single file,
an index per directory (aka SVN/CVS) is too slow for the common
case of <30k files.

Really?  What's wrong with the name sorting rule?  I kind of like it.

bup's current index - after I abandoned my clone of the git one since
it was too slow with insane numbers of files - is very fast for reads
and in-place updates using mmap.

Essentially, it's a tree, starting from the outermost leafs and
leading toward the entry at the very end of the file, which is the
root.  (The idea of doing it backwards was that I could write the file
sequentially.  In retrospect, that was probably an unnecessarily
brain-bending waste of time and the root should have been the first
entry instead.)

For speed, the bup index can just mark entries as deleted using a flag
rather than actually rewriting the whole indexfile.  Unfortunately, I
failed to make it sufficiently flexible to *add* new entries without
needing to rewrite the whole thing.  In bup, that's a big deal
(especially since python is kind of slow and there are typically >1
million files in the index).  In git, it's maybe not so bad; after
all, the current implementation rewrites the index *every* time and
nobody notices.

Okay, I have an idea. If I understand correctly, the index is a flat database of records including a pathname and several fixed-length fields. Since the records are not fixed-length, only sequential search is possible, even though the records are sorted by pathname.

Here's the idea: Divide the database into blocks. Each block contains a block header and the records belonging to a single directory. The block header contains the length of the block and also the offset to the next block, in bytes. In addition to a record for each indexed file in a directory, a directory's block also contains records for subdirectories. The mode flags in a record indicate the record type. Directory records contain an offset in bytes to the block for that directory (in place of the SHA-1 hash). The block list is preceded by a file header, which includes the offset in bytes of the root block. All offsets are from the beginning of the file.

Instead of having to search among every file in the repository, the search space now includes only the immediate descendants of each directory in the target file's path. If a directory is modified then it can either be rewritten in place (if there's sufficient room) or appended to the end of the file (requiring the old and new sequentially preceding blocks and the parent directory's block to update their offsets).

Is this useful?

Josh


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]