Clarifications on the "faster index format" project

Thomas Rast <trast@xxxxxxxxxxx> · Wed, 28 Mar 2012 02:36:49 +0200

Dear students,

The proposal "Designing a faster index format" has attracted quite a bit
more attention that we expected.  We would like to emphasize that it is
not an easy project.

I have myself only found out about most of the points I am listing here
through interaction with students.  My apologies for this.  Take them as
a (non-exhaustive!) list of concerns.  The more of them you can address
in your proposal, the better.

Issues with the original project idea:

- Further timings have shown that even on humongous indexes such as
  Webkit's (roughly 25MB), git-add is still pretty fast.  It is a good
  baseline for what index-changing commands need to do however.  Judge
  from your own timings and state what case you want to optimize.

- The notation is confusing.  Asymptotics noted in the proposal should
  distinguish the variables "# index entries" (n), "# changed entries",
  and depending on the application also "size of index file" (though
  this is fairly tightly coupled with n) or other variables.

Some hardnesses that we expect:

- Without prior knowledge of the index, it is hard to see how a new
  format must be designed to stay within the requirements.

- The index format is closely tied to a lot of core git's code.  The
  main work of git-ls-files, git-status, git-read-tree, git-add,
  git-diff etc. all directly accesses the index memory structure.  This
  means that changing the in-memory structure is a lot of work.

- Changing the in-memory structure also makes conversion between the old
  and new format more difficult.

- Writing out only the changed entries would save a lot of time.
  However there are many code paths that change/add/remove index
  entries, and they must all record what to write in some way.

- Cutting down the amount of verification going on is very tempting, but
  needs careful design to keep the chances of a bit error propagating
  all the way into the repository small.

There are many tradeoffs to be made, which must be evaluated carefully:

- If the current lock-rewrite-rename scheme remains in place, any change
  only affects the in-core work done.  If the lock scheme is changed,
  there are many different cases of corruption/partial writes that must
  be handled.

- There may be ways to reduce the data written (and thus checksummed) at
  the cost of extra work.  Similarly, a more complex data structure may
  or may not pay off depending on the extra space taken (or saved) and
  extra bookkeeping done.

- Some improvements would be possible by using techniques from database
  libraries.  However, this either means an external dependency or a lot
  of extra work.  It may also come with extra startup costs.

- Using database libraries may be a deal-breaker for git's own
  portability or the other readers (libgit2 and jgit, mainly).

The format should cope with requirements that are not clearly specified
at this time.  For example:

- The existing code also assumes that iterating over an index is not a
  problem.  For possible future work, the format should allow for
  iterating over only a select part of the index (known, e.g., from an
  inotify daemon or a pathspec) quickly.

Finally I have a request not related to project hardness: please Cc the
proposed mentors for discussions on the respective projects :-)

Thanks for reading,

Thomas

-- 
Thomas Rast
trast@{inf,student}.ethz.ch
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html