On Mon, Apr 28, 2014 at 3:55 AM, Nguyễn Thái Ngọc Duy <pclouds@xxxxxxxxx> wrote: > I hinted about it earlier [1]. It now passes the test suite and with a > design that I'm happy with (thanks to Junio for a suggestion about the > rename problem). > > From the user point of view, this reduces the writable size of index > down to the number of updated files. For example my webkit index v4 is > 14MB. With a fresh split, I only have to update an index of 200KB. > Every file I touch will add about 80 bytes to that. As long as I don't > touch every single tracked file in my worktree, I should not pay > penalty for writing 14MB index file on every operation. This is a very welcome type of improvement. I am however concerned about the complexity of the format employed. Why do we need two EWAH bitmaps in the new index? Why isn't this just a pair of sorted files that are merge-joined at read, with records in $GIT_DIR/index taking priority over same-named records in $GIT_DIR/sharedindex.$SHA1? Deletes could be marked with a bit or an "all zero" metadata record. > The read penalty is not addressed here, so I still pay 14MB hashing > cost. But that's an easy problem. We could cache the validated index > in a daemon. Whenever git needs to load an index, it pokes the daemon. > The daemon verifies that the on-disk index still has the same > signature, then sends the in-mem index to git. When git updates the > index, it pokes the daemon again to update in-mem index. Next time git > reads the index, it does not have to pay I/O cost any more (actually > it does but the cost is hidden away when you do not have to read it > yet). If we are going this far, maybe it is worthwhile building a mmap() region the daemon exports to the git client that holds the "in memory" format of the index. Clients would mmap this PROT_READ, MAP_PRIVATE and can then quickly access the base file information without doing further validation, or copying the large(ish) data over a pipe. Junio had some other great ideas for improving the index on really large trees. Maybe I should let him comment since they are really his ideas. Something about not even checking out most files, storing most subtrees as just a "tree" entry in the index. E.g. if you are a bad developer and never touch the "t/" subdirectory then that is stored as just "t" and the SHA-1 of the "t" tree, rather than the recursively exploded list of the test directory. -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html