Re: [PATCH v1] read-cache: speed up index load through parallelization

Ben Peart <peartben@xxxxxxxxx> · Thu, 23 Aug 2018 15:44:58 -0400

On 8/23/2018 1:31 PM, Stefan Beller wrote:
On Thu, Aug 23, 2018 at 8:45 AM Ben Peart <Ben.Peart@xxxxxxxxxxxxx> wrote:

This patch helps address the CPU cost of loading the index by creating
multiple threads to divide the work of loading and converting the cache
entries across all available CPU cores.

It accomplishes this by having the primary thread loop across the index file
tracking the offset and (for V4 indexes) expanding the name. It creates a
thread to process each block of entries as it comes to them. Once the
threads are complete and the cache entries are loaded, the rest of the
extensions can be loaded and processed normally on the primary thread.

Performance impact:

read cache .git/index times on a synthetic repo with:

100,000 entries
FALSE       TRUE        Savings     %Savings
0.014798767 0.009580433 0.005218333 35.26%

1,000,000 entries
FALSE       TRUE        Savings     %Savings
0.240896533 0.1751243   0.065772233 27.30%

read cache .git/index times on an actual repo with:

~3M entries
FALSE       TRUE        Savings     %Savings
0.59898098  0.4513169   0.14766408  24.65%

Signed-off-by: Ben Peart <Ben.Peart@xxxxxxxxxxxxx>
---

Notes:
     Base Ref: master
     Web-Diff: https://github.com/benpeart/git/commit/67a700419b
     Checkout: git fetch https://github.com/benpeart/git read-index-multithread-v1 && git checkout 67a700419b

  Documentation/config.txt |   8 ++
  config.c                 |  13 +++
  config.h                 |   1 +
  read-cache.c             | 218 ++++++++++++++++++++++++++++++++++-----
  4 files changed, 216 insertions(+), 24 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index 1c42364988..3344685cc4 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -899,6 +899,14 @@ relatively high IO latencies.  When enabled, Git will do the
  index comparison to the filesystem data in parallel, allowing
  overlapping IO's.  Defaults to true.

+core.fastIndex::
+       Enable parallel index loading
++
+This can speed up operations like 'git diff' and 'git status' especially
+when the index is very large.  When enabled, Git will do the index
+loading from the on disk format to the in-memory format in parallel.
+Defaults to true.

"fast" is a non-descriptive word as we try to be fast in any operation?
Maybe core.parallelIndexReading as that just describes what it
turns on/off, without second guessing its effects?
(Are there still computers with just a single CPU, where this would not
make it faster? ;-))


How about core.parallelReadIndex?  Slightly shorter and matches the 
function names better.


+int git_config_get_fast_index(void)
+{
+       int val;
+
+       if (!git_config_get_maybe_bool("core.fastindex", &val))
+               return val;
+
+       if (getenv("GIT_FASTINDEX_TEST"))
+               return 1;

We look at this env value just before calling this function,
can be write it to only look at the evn variable once?


Sure, I didn't like the fact that it was called twice but didn't get 
around to cleaning it up.

+++ b/config.h
@@ -250,6 +250,7 @@ extern int git_config_get_untracked_cache(void);
  extern int git_config_get_split_index(void);
  extern int git_config_get_max_percent_split_change(void);
  extern int git_config_get_fsmonitor(void);
+extern int git_config_get_fast_index(void);

Oh. nd/no-extern did not cover config.h



+#ifndef min
+#define min(a,b) (((a) < (b)) ? (a) : (b))
+#endif

We do not have a minimum function in the tree,
except for xdiff/xmacros.h:29: XDL_MIN.
I wonder what the rationale is for not having a MIN()
definition, I think we discussed that on the list a couple
times but the rationale escaped me.

If we introduce a min/max macro, can we put it somewhere
more prominent? (I would find it useful elsewhere)


I'll avoid that particular rabbit hole and just remove the min macro 
definition.  ;-)

+/*
+* Mostly randomly chosen maximum thread counts: we
+* cap the parallelism to online_cpus() threads, and we want
+* to have at least 7500 cache entries per thread for it to
+* be worth starting a thread.
+*/
+#define THREAD_COST            (7500)

This reads very similar to preload-index.c THREAD_COST

+       /* loop through index entries starting a thread for every thread_nr entries */
+       consumed = thread = 0;
+       for (i = 0; ; i++) {
+               struct ondisk_cache_entry *ondisk;
+               const char *name;
+               unsigned int flags;
+
+               /* we've reached the begining of a block of cache entries, kick off a thread to process them */

beginning


Thanks

Thanks,
Stefan