I was looking at the 'git gc' stalled-cycles profile and noticed that lookup_object() was the top entry: aldebaran:~/git> perf record -e stalled-cycles -F 10000 ./git gc Counting objects: 32459, done . Delta compression using up to 16 threads. Compressing objects: 100% (8161/8161), done. Writing objects: 100% (32459/32459), done. Total 32459 (delta 24077), reused 32459 (delta 24077) [ perf record: Woken up 5 times to write data ] [ perf record: Captured and wrote 1.199 MB perf.data (~52366 samples) ] aldebaran:~/git> perf report | head # Events: 30K stalled-cycles # # Overhead Command Shared Object Symbol # ........ .......... ..................... ......................................... # 36.36% git git [.] lookup_object 6.55% git git [.] find_pack_entry_one 6.53% git libz.so.1.2.5 [.] 0xc416 5.94% git libz.so.1.2.5 [.] inflate 4.12% git [kernel.kallsyms] [k] do_raw_spin_lock Annotated output showed the culprit: : if (!obj_hash) : return NULL; : : i = hashtable_index(sha1); : while ((obj = obj_hash[i]) != NULL) { 4.13 : 498316: eb 1f jmp 498337 <lookup_object+0x47> 0.00 : 498318: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) 0.00 : 49831f: 00 : if (!hashcmp(sha1, obj->sha1)) 1.48 : 498320: 48 8d 78 04 lea 0x4(%rax),%rdi 0.02 : 498324: 4c 89 d6 mov %r10,%rsi 0.00 : 498327: 4c 89 d9 mov %r11,%rcx 26.12 : 49832a: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi) 17.12 : 49832c: 74 14 je 498342 <lookup_object+0x52> : break; : i++; 6.88 : 49832e: 83 c2 01 add $0x1,%edx : if (i == obj_hash_size) : i = 0; 2.28 : 498331: 44 39 ca cmp %r9d,%edx 0.24 : 498334: 0f 44 d3 cmove %ebx,%edx "perf stat --detailed" shows us the following picture: Performance counter stats for './git gc': 3145.596314 task-clock # 0.877 CPUs utilized 1,760 context-switches # 0.001 M/sec 174 CPU-migrations # 0.000 M/sec 41,509 page-faults # 0.013 M/sec 9,753,859,587 cycles # 3.101 GHz (22.91%) 2,555,944,921 stalled-cycles # 26.20% of all cycles are idle (33.89%) 8,976,468,086 instructions # 0.92 insns per cycle # 0.28 stalled cycles per insn (44.83%) 1,782,743,476 branches # 566.743 M/sec (55.70%) 85,045,367 branch-misses # 4.77% of all branches (66.54%) 1,982,452,996 L1-dcache-loads # 630.231 M/sec (66.18%) 152,320,833 L1-dcache-load-misses # 7.68% of all L1-dcache hits (55.50%) 43,358,073 LLC-loads # 13.784 M/sec (45.33%) 2,636,774 LLC-load-misses # 0.838 M/sec (11.50%) 3.586922714 seconds time elapsed ... so git gc is still fitting into the L1 cache mostly, and it rarely falls out of the L2 cache. So CPU execution is stalling processing longish hash chains and comparing sha1's. So i tried the quick patch below, which just increases the object hash size more aggressively, to 16x of the object count, not the previous 2x sizing. The results are (run against the Git repo itself, on ec014ea ("Git 1.7.5")): # # Before: # Performance counter stats for './git gc' (10 runs): 3147.437358 task-clock # 0.793 CPUs utilized ( +- 0.18% ) 1,753 context-switches # 0.001 M/sec ( +- 3.09% ) 165 CPU-migrations # 0.000 M/sec ( +- 2.86% ) 42,587 page-faults # 0.014 M/sec ( +- 0.04% ) 10,041,078,653 cycles # 3.190 GHz ( +- 0.18% ) 2,613,923,719 stalled-cycles # 26.03% of all cycles are idle ( +- 0.45% ) 9,110,524,009 instructions # 0.91 insns per cycle # 0.29 stalled cycles per insn ( +- 0.03% ) 1,796,732,369 branches # 570.856 M/sec ( +- 0.04% ) 84,828,313 branch-misses # 4.72% of all branches ( +- 0.06% ) 3.971525714 seconds time elapsed ( +- 8.56% ) # # After: # Performance counter stats for './git gc' (10 runs): 2805.034899 task-clock # 0.757 CPUs utilized ( +- 0.16% ) 1,709 context-switches # 0.001 M/sec ( +- 2.51% ) 169 CPU-migrations # 0.000 M/sec ( +- 1.73% ) 42,963 page-faults # 0.015 M/sec ( +- 0.23% ) 8,944,314,899 cycles # 3.189 GHz ( +- 0.17% ) 2,118,720,399 stalled-cycles # 23.69% of all cycles are idle ( +- 0.52% ) 9,017,027,059 instructions # 1.01 insns per cycle # 0.23 stalled cycles per insn ( +- 0.03% ) 1,780,388,097 branches # 634.712 M/sec ( +- 0.04% ) 76,104,907 branch-misses # 4.27% of all branches ( +- 0.07% ) 3.707549437 seconds time elapsed ( +- 7.13% ) The takeaway is that stalled cycles dropped by 23%: 2,613,923,719 stalled-cycles # 26.03% of all cycles are idle ( +- 0.45% ) 2,118,720,399 stalled-cycles # 23.69% of all cycles are idle ( +- 0.52% ) And total runtime (measured in cycles) decreased by 12.2%: 10,041,078,653 cycles # 3.190 GHz ( +- 0.18% ) 8,944,314,899 cycles # 3.189 GHz ( +- 0.17% ) Elapsed time dropped as well as expected, but measurement noise [last column] is high there, due to IO effects. Cache misses are down as well: # Before: 1,982,452,996 L1-dcache-loads # 630.231 M/sec (66.18%) 152,320,833 L1-dcache-load-misses # 7.68% of all L1-dcache hits (55.50%) 43,358,073 LLC-loads # 13.784 M/sec (45.33%) 2,636,774 LLC-load-misses # 0.838 M/sec (11.50%) # After: 1,946,597,133 L1-dcache-loads # 687.078 M/sec (67.61%) 125,634,149 L1-dcache-load-misses # 6.45% of all L1-dcache hits (56.38%) 30,778,323 LLC-loads # 10.864 M/sec (44.40%) 2,697,827 LLC-load-misses # 0.952 M/sec (10.94%) This is somewhat surprising, the hash table got larger after all. Cachemisses went down probably due to less chain-walking reducing the effective working set size of git gc. So oversizing the hash seems to works well for git gc. I tried a 4x and 8x oversizing as well, it was an improvement but 16x clearly beats it. Larger than 16x seems like overkill. I guess the hash function itself is as good as it gets: static unsigned int hashtable_index(const unsigned char *sha1) { unsigned int i; memcpy(&i, sha1, sizeof(unsigned int)); return i % obj_hash_size; } as sha1's ought to be fairly well distributed. The problem i suspect is that even with perfectly random distribution of the hash, there will always be second, third and higher order chains, which get more and more expensive to walk. So a 2x sized hash table becomes overcrowded. Thanks, Ingo Signed-off-by: Ingo Molnar <mingo@xxxxxxx> diff --git a/object.c b/object.c index 7e1f2bb..b3fe485 100644 --- a/object.c +++ b/object.c @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1) static void grow_object_hash(void) { int i; - int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size; + int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size; struct object **new_hash; new_hash = xcalloc(new_hash_size, sizeof(struct object *)); @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o) obj->flags = 0; hashcpy(obj->sha1, sha1); - if (obj_hash_size - 1 <= nr_objs * 2) + if (obj_hash_size - 1 <= nr_objs * 16) grow_object_hash(); insert_obj_hash(obj, obj_hash, obj_hash_size); -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html