Re: [PATCH 00/32] Split index mode for very large indexes

Shawn Pearce <spearce@xxxxxxxxxxx> · Mon, 28 Apr 2014 14:18:44 -0700

On Mon, Apr 28, 2014 at 3:55 AM, Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> wrote:
> I hinted about it earlier [1]. It now passes the test suite and with a
> design that I'm happy with (thanks to Junio for a suggestion about the
> rename problem).
>
> From the user point of view, this reduces the writable size of index
> down to the number of updated files. For example my webkit index v4 is
> 14MB. With a fresh split, I only have to update an index of 200KB.
> Every file I touch will add about 80 bytes to that. As long as I don't
> touch every single tracked file in my worktree, I should not pay
> penalty for writing 14MB index file on every operation.

This is a very welcome type of improvement.

I am however concerned about the complexity of the format employed.
Why do we need two EWAH bitmaps in the new index? Why isn't this just
a pair of sorted files that are merge-joined at read, with records in
$GIT_DIR/index taking priority over same-named records in
$GIT_DIR/sharedindex.$SHA1?  Deletes could be marked with a bit or an
"all zero" metadata record.

> The read penalty is not addressed here, so I still pay 14MB hashing
> cost. But that's an easy problem. We could cache the validated index
> in a daemon. Whenever git needs to load an index, it pokes the daemon.
> The daemon verifies that the on-disk index still has the same
> signature, then sends the in-mem index to git. When git updates the
> index, it pokes the daemon again to update in-mem index. Next time git
> reads the index, it does not have to pay I/O cost any more (actually
> it does but the cost is hidden away when you do not have to read it
> yet).

If we are going this far, maybe it is worthwhile building a mmap()
region the daemon exports to the git client that holds the "in memory"
format of the index. Clients would mmap this PROT_READ, MAP_PRIVATE
and can then quickly access the base file information without doing
further validation, or copying the large(ish) data over a pipe.

Junio had some other great ideas for improving the index on really
large trees. Maybe I should let him comment since they are really his
ideas. Something about not even checking out most files, storing most
subtrees as just a "tree" entry in the index. E.g. if you are a bad
developer and never touch the "t/" subdirectory then that is stored as
just "t" and the SHA-1 of the "t" tree, rather than the recursively
exploded list of the test directory.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html