On Tue, Jun 07, 2016 at 05:36:59PM -0700, Junio C Hamano wrote: > Jeff King <peff@xxxxxxxx> writes: > > > An alternative to this would be implement something like: > > > > struct tree *tp, tp_fallback[2]; > > if (nparent <= ARRAY_SIZE(tp_fallback)) > > tp = tp_fallback; > > else > > ALLOC_ARRAY(tp, nparent); > > ... > > if (tp != tp_fallback) > > free(tp); > > > > That would let us drop our xalloca() portability code > > entirely. But in my measurements, this seemed to perform > > slightly worse than the xalloca solution. > > It indeed is curious why this "obvious" alternative performed > worse. Yeah. I'd be happy if somebody wanted to prove me wrong here. It's entirely possible I just did something stupid. I had wrapped up the above in the FAST_ARRAY_ALLOC macro. It would waste the stack space when we _did_ have to call malloc, but that should almost never happen in my benchmark. It also allocates an extra slot on the stack for non-merge commits. So I guess the wasted 24 bytes or whatever could have in impact. It's also possible that the extra variable simply tickles the optimizer in some way; I didn't look at the generated asm. FWIW, here are the timings I had come up with for running "git log --raw --no-abbrev --no-renames" on linux.git (the same benchmark from the commit that originally added the alloca). The "time" output at the end is best-of-five, with the "attempt" lines showing the wall-clock time for each: [original, with xalloca] Attempt 1: 30.669 Attempt 2: 30.782 Attempt 3: 31.807 Attempt 4: 31.152 Attempt 5: 30.243 real 0m30.243s user 0m30.112s sys 0m0.128s [xmalloc/free instead of xalloca] Attempt 1: 31.306 Attempt 2: 31.814 Attempt 3: 31.138 Attempt 4: 31.787 Attempt 5: 32.23 real 0m31.138s user 0m30.976s sys 0m0.160s [local var with fallback to xmalloc] Attempt 1: 30.926 Attempt 2: 31.466 Attempt 3: 31.865 Attempt 4: 31.345 Attempt 5: 33.159 real 0m30.926s user 0m30.804s sys 0m0.120s So the improvement here over even a naive malloc/free is _really_ small. You'll note that the best-of-five for xmalloc is actually smaller than the worst-case with xalloca. But the mean is still 2% better. The mean for the "fallback" case aren't really any better (which makes me wonder if I was somehow accidentally calling malloc each time). So I dunno. For 2%, I was tempted to simply say "screw it, let's just forget this micro-optimization and call xmalloc". This patch is the conservative choice in that it has the same performance profile as the original. Or so I thought. I didn't record the old timings, but they looked similar to the original. I just recreated them while writing this email and got: [xalloca with fallback to xmalloc] Attempt 1: 31.356 Attempt 2: 31.725 Attempt 3: 31.454 Attempt 4: 30.898 Attempt 5: 31.937 real 0m30.898s user 0m30.752s sys 0m0.144s which really isn't much better than the local-var case. I wonder if just having the conditional stack/heap is what kills us (either itself, or because it affects the optimizer). I dunno. > > +#define FAST_ARRAY_ALLOC(x, nr) do { \ > > + if ((nr) <= 2) \ > > + (x) = xalloca((nr) * sizeof(*(x))); \ > > + else \ > > + ALLOC_ARRAY((x), nr); \ > > +} while(0) > > +#define FAST_ARRAY_FREE(x, nr) do { \ > > + if ((nr) > 2) \ > > + free((x)); \ > > +} while(0) > > An obvious and clean implementation of the idea. > > The only slight worry I have is that nr must stay constant until the > time the caller calls FREE(), but for the only three callsite pairs > it is clear nparent will stay constant and this is local to the > file. Yep, I had the same worry. I think it's OK because the damage is limited to tree-diff.c. I'm not sure I would promote this macro for general use. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html