On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote: > On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@xxxxxxxx> wrote: > > > > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: > > > > > Here are my p5302 numbers on linux.git, by the way. > > > > > > Test jk/p5302-repeat-fix > > > ------------------------------------------------------------------ > > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) > > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) > > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) > > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) > > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) > > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) > > > > > > which still imply that "4" is a win over "3" ("8" is slightly better > > > still in wall-clock time, but the total CPU rises dramatically; that's > > > probably because this is a quad-core with hyperthreading, so by that > > > point we're just throttling down the CPUs). > > > > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I > > tweaked the test to keep going after eight threads): > > > > Test HEAD > > ---------------------------------------------------- > > 5302.2: index-pack 1 threads 376.88(364.50+11.52) > > 5302.3: index-pack 2 threads 228.13(371.21+17.86) > > 5302.4: index-pack 4 threads 151.41(387.06+21.12) > > 5302.5: index-pack 8 threads 113.68(413.40+25.80) > > 5302.6: index-pack 16 threads 100.60(511.85+37.53) > > 5302.7: index-pack 32 threads 94.43(623.82+45.70) > > 5302.8: index-pack 40 threads 93.64(702.88+47.61) > > > > I don't think any of this is _particularly_ relevant to your case, but > > it really seems to me that the default of capping at 3 threads is too > > low. > > Looking back at the multithread commit, I think the trend was the same > and I capped it because the gain was not proportional to the number of > cores we threw at index-pack anymore. I would not be opposed to > raising the cap though (or maybe just remove it) I'm not sure what the right cap would be. I don't think it's static; we'd want ~4 threads on the top case, and 10-20 on the bottom one. It does seem like there's an inflection point in the graph at N/2 threads. But then maybe that's just because these are hyper-threaded machines, so "N/2" is the actual number of physical cores, and the inflated CPU times above that are just because we can't turbo-boost then, so we're actually clocking slower. Multi-threaded profiling and measurement is such a mess. :) So I'd say the right answer is probably either online_cpus() or half that. The latter would be more appropriate for the machines I have, but I'd worry that it would leave performance on the table for non-intel machines. -Peff