Dear students, The proposal "Designing a faster index format" has attracted quite a bit more attention that we expected. We would like to emphasize that it is not an easy project. I have myself only found out about most of the points I am listing here through interaction with students. My apologies for this. Take them as a (non-exhaustive!) list of concerns. The more of them you can address in your proposal, the better. Issues with the original project idea: - Further timings have shown that even on humongous indexes such as Webkit's (roughly 25MB), git-add is still pretty fast. It is a good baseline for what index-changing commands need to do however. Judge from your own timings and state what case you want to optimize. - The notation is confusing. Asymptotics noted in the proposal should distinguish the variables "# index entries" (n), "# changed entries", and depending on the application also "size of index file" (though this is fairly tightly coupled with n) or other variables. Some hardnesses that we expect: - Without prior knowledge of the index, it is hard to see how a new format must be designed to stay within the requirements. - The index format is closely tied to a lot of core git's code. The main work of git-ls-files, git-status, git-read-tree, git-add, git-diff etc. all directly accesses the index memory structure. This means that changing the in-memory structure is a lot of work. - Changing the in-memory structure also makes conversion between the old and new format more difficult. - Writing out only the changed entries would save a lot of time. However there are many code paths that change/add/remove index entries, and they must all record what to write in some way. - Cutting down the amount of verification going on is very tempting, but needs careful design to keep the chances of a bit error propagating all the way into the repository small. There are many tradeoffs to be made, which must be evaluated carefully: - If the current lock-rewrite-rename scheme remains in place, any change only affects the in-core work done. If the lock scheme is changed, there are many different cases of corruption/partial writes that must be handled. - There may be ways to reduce the data written (and thus checksummed) at the cost of extra work. Similarly, a more complex data structure may or may not pay off depending on the extra space taken (or saved) and extra bookkeeping done. - Some improvements would be possible by using techniques from database libraries. However, this either means an external dependency or a lot of extra work. It may also come with extra startup costs. - Using database libraries may be a deal-breaker for git's own portability or the other readers (libgit2 and jgit, mainly). The format should cope with requirements that are not clearly specified at this time. For example: - The existing code also assumes that iterating over an index is not a problem. For possible future work, the format should allow for iterating over only a select part of the index (known, e.g., from an inotify daemon or a pathspec) quickly. Finally I have a request not related to project hardness: please Cc the proposed mentors for discussions on the respective projects :-) Thanks for reading, Thomas -- Thomas Rast trast@{inf,student}.ethz.ch -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html