Got no feedback for this patch - might have been drowned in the hashcmp discussion :) I tried various levels of object hash spreading factor and 16x seemed like a good one, but i can try a different patch as well if there's other ideas to fix this source of CPU utilization inefficiency. Thanks, Ingo * Ingo Molnar <mingo@xxxxxxx> wrote: > I was looking at the 'git gc' stalled-cycles profile and noticed that > lookup_object() was the top entry: > > aldebaran:~/git> perf record -e stalled-cycles -F 10000 ./git gc > Counting objects: 32459, done . > Delta compression using up to 16 threads. > Compressing objects: 100% (8161/8161), done. > Writing objects: 100% (32459/32459), done. > Total 32459 (delta 24077), reused 32459 (delta 24077) > [ perf record: Woken up 5 times to write data ] > [ perf record: Captured and wrote 1.199 MB perf.data (~52366 samples) ] > > aldebaran:~/git> perf report | head > # Events: 30K stalled-cycles > # > # Overhead Command Shared Object Symbol > # ........ .......... ..................... ......................................... > # > 36.36% git git [.] lookup_object > 6.55% git git [.] find_pack_entry_one > 6.53% git libz.so.1.2.5 [.] 0xc416 > 5.94% git libz.so.1.2.5 [.] inflate > 4.12% git [kernel.kallsyms] [k] do_raw_spin_lock > > Annotated output showed the culprit: > > : if (!obj_hash) > : return NULL; > : > : i = hashtable_index(sha1); > : while ((obj = obj_hash[i]) != NULL) { > 4.13 : 498316: eb 1f jmp 498337 <lookup_object+0x47> > 0.00 : 498318: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1) > 0.00 : 49831f: 00 > : if (!hashcmp(sha1, obj->sha1)) > 1.48 : 498320: 48 8d 78 04 lea 0x4(%rax),%rdi > 0.02 : 498324: 4c 89 d6 mov %r10,%rsi > 0.00 : 498327: 4c 89 d9 mov %r11,%rcx > 26.12 : 49832a: f3 a6 repz cmpsb %es:(%rdi),%ds:(%rsi) > 17.12 : 49832c: 74 14 je 498342 <lookup_object+0x52> > : break; > : i++; > 6.88 : 49832e: 83 c2 01 add $0x1,%edx > : if (i == obj_hash_size) > : i = 0; > 2.28 : 498331: 44 39 ca cmp %r9d,%edx > 0.24 : 498334: 0f 44 d3 cmove %ebx,%edx > > "perf stat --detailed" shows us the following picture: > > Performance counter stats for './git gc': > > 3145.596314 task-clock # 0.877 CPUs utilized > 1,760 context-switches # 0.001 M/sec > 174 CPU-migrations # 0.000 M/sec > 41,509 page-faults # 0.013 M/sec > 9,753,859,587 cycles # 3.101 GHz (22.91%) > 2,555,944,921 stalled-cycles # 26.20% of all cycles are idle (33.89%) > 8,976,468,086 instructions # 0.92 insns per cycle > # 0.28 stalled cycles per insn (44.83%) > 1,782,743,476 branches # 566.743 M/sec (55.70%) > 85,045,367 branch-misses # 4.77% of all branches (66.54%) > 1,982,452,996 L1-dcache-loads # 630.231 M/sec (66.18%) > 152,320,833 L1-dcache-load-misses # 7.68% of all L1-dcache hits (55.50%) > 43,358,073 LLC-loads # 13.784 M/sec (45.33%) > 2,636,774 LLC-load-misses # 0.838 M/sec (11.50%) > > 3.586922714 seconds time elapsed > > ... so git gc is still fitting into the L1 cache mostly, and it rarely falls > out of the L2 cache. So CPU execution is stalling processing longish hash > chains and comparing sha1's. > > So i tried the quick patch below, which just increases the object hash size > more aggressively, to 16x of the object count, not the previous 2x sizing. > > The results are (run against the Git repo itself, on ec014ea ("Git 1.7.5")): > > # > # Before: > # > > Performance counter stats for './git gc' (10 runs): > > 3147.437358 task-clock # 0.793 CPUs utilized ( +- 0.18% ) > 1,753 context-switches # 0.001 M/sec ( +- 3.09% ) > 165 CPU-migrations # 0.000 M/sec ( +- 2.86% ) > 42,587 page-faults # 0.014 M/sec ( +- 0.04% ) > 10,041,078,653 cycles # 3.190 GHz ( +- 0.18% ) > 2,613,923,719 stalled-cycles # 26.03% of all cycles are idle ( +- 0.45% ) > 9,110,524,009 instructions # 0.91 insns per cycle > # 0.29 stalled cycles per insn ( +- 0.03% ) > 1,796,732,369 branches # 570.856 M/sec ( +- 0.04% ) > 84,828,313 branch-misses # 4.72% of all branches ( +- 0.06% ) > > 3.971525714 seconds time elapsed ( +- 8.56% ) > > # > # After: > # > > Performance counter stats for './git gc' (10 runs): > > 2805.034899 task-clock # 0.757 CPUs utilized ( +- 0.16% ) > 1,709 context-switches # 0.001 M/sec ( +- 2.51% ) > 169 CPU-migrations # 0.000 M/sec ( +- 1.73% ) > 42,963 page-faults # 0.015 M/sec ( +- 0.23% ) > 8,944,314,899 cycles # 3.189 GHz ( +- 0.17% ) > 2,118,720,399 stalled-cycles # 23.69% of all cycles are idle ( +- 0.52% ) > 9,017,027,059 instructions # 1.01 insns per cycle > # 0.23 stalled cycles per insn ( +- 0.03% ) > 1,780,388,097 branches # 634.712 M/sec ( +- 0.04% ) > 76,104,907 branch-misses # 4.27% of all branches ( +- 0.07% ) > > 3.707549437 seconds time elapsed ( +- 7.13% ) > > The takeaway is that stalled cycles dropped by 23%: > > 2,613,923,719 stalled-cycles # 26.03% of all cycles are idle ( +- 0.45% ) > 2,118,720,399 stalled-cycles # 23.69% of all cycles are idle ( +- 0.52% ) > > And total runtime (measured in cycles) decreased by 12.2%: > > 10,041,078,653 cycles # 3.190 GHz ( +- 0.18% ) > 8,944,314,899 cycles # 3.189 GHz ( +- 0.17% ) > > Elapsed time dropped as well as expected, but measurement noise [last column] > is high there, due to IO effects. > > Cache misses are down as well: > > # Before: > > 1,982,452,996 L1-dcache-loads # 630.231 M/sec (66.18%) > 152,320,833 L1-dcache-load-misses # 7.68% of all L1-dcache hits (55.50%) > 43,358,073 LLC-loads # 13.784 M/sec (45.33%) > 2,636,774 LLC-load-misses # 0.838 M/sec (11.50%) > > # After: > > 1,946,597,133 L1-dcache-loads # 687.078 M/sec (67.61%) > 125,634,149 L1-dcache-load-misses # 6.45% of all L1-dcache hits (56.38%) > 30,778,323 LLC-loads # 10.864 M/sec (44.40%) > 2,697,827 LLC-load-misses # 0.952 M/sec (10.94%) > > This is somewhat surprising, the hash table got larger after all. Cachemisses > went down probably due to less chain-walking reducing the effective working set > size of git gc. > > So oversizing the hash seems to works well for git gc. I tried a 4x and 8x > oversizing as well, it was an improvement but 16x clearly beats it. Larger > than 16x seems like overkill. > > I guess the hash function itself is as good as it gets: > > static unsigned int hashtable_index(const unsigned char *sha1) > { > unsigned int i; > memcpy(&i, sha1, sizeof(unsigned int)); > return i % obj_hash_size; > } > > as sha1's ought to be fairly well distributed. The problem i suspect is that > even with perfectly random distribution of the hash, there will always be > second, third and higher order chains, which get more and more expensive to > walk. So a 2x sized hash table becomes overcrowded. > > Thanks, > > Ingo > > Signed-off-by: Ingo Molnar <mingo@xxxxxxx> > > diff --git a/object.c b/object.c > index 7e1f2bb..b3fe485 100644 > --- a/object.c > +++ b/object.c > @@ -91,7 +91,7 @@ struct object *lookup_object(const unsigned char *sha1) > static void grow_object_hash(void) > { > int i; > - int new_hash_size = obj_hash_size < 32 ? 32 : 2 * obj_hash_size; > + int new_hash_size = obj_hash_size < 32 ? 32 : 16 * obj_hash_size; > struct object **new_hash; > > new_hash = xcalloc(new_hash_size, sizeof(struct object *)); > @@ -116,7 +116,7 @@ void *create_object(const unsigned char *sha1, int type, void *o) > obj->flags = 0; > hashcpy(obj->sha1, sha1); > > - if (obj_hash_size - 1 <= nr_objs * 2) > + if (obj_hash_size - 1 <= nr_objs * 16) > grow_object_hash(); > > insert_obj_hash(obj, obj_hash, obj_hash_size); -- Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html