Re: Store refreshed stat info in a separate file?

Duy Nguyen <pclouds@xxxxxxxxx> · Fri, 25 Apr 2014 12:18:08 +0700

On Sat, Apr 19, 2014 at 12:43 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Having said that, I do not think there is a fundamental reason why
> the stat data has to live inside the same index file.  A separate
> file is just fine, as long as you can reliably detect that they went
> out of sync for whatever reason (e.g. "the index proper updated, a
> stale stat file left beind"), and storing the trailer checksum from
> the corresponding index in this new file is an obvious and good
> solution.

I've gone further and store index updates (including entry removals
and additions) to the second index file so that index I/O cost is now
proportional to the number of changed entries, not the work tree size
(sort of). Which makes it scale much better when the work tree is
huge. There is one flaw though. I'm expecting many "yuck" responses
from people. So let's try to settle it now, or drop the idea.

The idea is we can support another mode, where index content is stored
in two files, the small $GIT_DIR/index and large $GIT_DIR/index.base.
"index" contains changes that should be applied to "index.base".
Whenever you do something to the index, "index" records those actions.
Git reads both index.base and index, then replay the action to have
the final index in memory. "index.base" contains full worktree data
and remains unchanged until "index" becomes too big/slow that changes
should be merged back to "index.base". This works great (my prototype
passed the test suite), and even greater than index v5 because v5
still rewrites the whole index file when an entry is added or removed.

But there is a problem with atomic update. The good old rename() does
not work well with 2 files. This is not a problem with the C part, I
can still make atomic update work. Scripts, on the other hand, may
rely on "mv" or similar commands/functions to prepare a temp index and
move it to $GIT_DIR/index. The workaround is merge back two files into
a single index file so that scripts can "mv $temp_index" as before and
pay the whole-index I/O penalty. An alternative is store two files in
one, the one index file actually consists two subfiles. We avoid the
atomic update problem, but we pay I/O cost for writing 10MB every time
an index is updated (but not hashing 10MB file) and introduce a new
index format. This is even yuckier in my opinion.

Should I continue, or drop it?
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html