On Tue, Apr 30 2019, Jeff King wrote: > On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote: > >> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@xxxxxxxx> wrote: >> > >> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote: >> > >> > > Here are my p5302 numbers on linux.git, by the way. >> > > >> > > Test jk/p5302-repeat-fix >> > > ------------------------------------------------------------------ >> > > 5302.2: index-pack 0 threads 307.04(303.74+3.30) >> > > 5302.3: index-pack 1 thread 309.74(306.13+3.56) >> > > 5302.4: index-pack 2 threads 177.89(313.73+3.60) >> > > 5302.5: index-pack 4 threads 117.14(344.07+4.29) >> > > 5302.6: index-pack 8 threads 112.40(607.12+5.80) >> > > 5302.7: index-pack default number of threads 135.00(322.03+3.74) >> > > >> > > which still imply that "4" is a win over "3" ("8" is slightly better >> > > still in wall-clock time, but the total CPU rises dramatically; that's >> > > probably because this is a quad-core with hyperthreading, so by that >> > > point we're just throttling down the CPUs). >> > >> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I >> > tweaked the test to keep going after eight threads): >> > >> > Test HEAD >> > ---------------------------------------------------- >> > 5302.2: index-pack 1 threads 376.88(364.50+11.52) >> > 5302.3: index-pack 2 threads 228.13(371.21+17.86) >> > 5302.4: index-pack 4 threads 151.41(387.06+21.12) >> > 5302.5: index-pack 8 threads 113.68(413.40+25.80) >> > 5302.6: index-pack 16 threads 100.60(511.85+37.53) >> > 5302.7: index-pack 32 threads 94.43(623.82+45.70) >> > 5302.8: index-pack 40 threads 93.64(702.88+47.61) >> > >> > I don't think any of this is _particularly_ relevant to your case, but >> > it really seems to me that the default of capping at 3 threads is too >> > low. >> >> Looking back at the multithread commit, I think the trend was the same >> and I capped it because the gain was not proportional to the number of >> cores we threw at index-pack anymore. I would not be opposed to >> raising the cap though (or maybe just remove it) > > I'm not sure what the right cap would be. I don't think it's static; > we'd want ~4 threads on the top case, and 10-20 on the bottom one. > > It does seem like there's an inflection point in the graph at N/2 > threads. But then maybe that's just because these are hyper-threaded > machines, so "N/2" is the actual number of physical cores, and the > inflated CPU times above that are just because we can't turbo-boost > then, so we're actually clocking slower. Multi-threaded profiling and > measurement is such a mess. :) > > So I'd say the right answer is probably either online_cpus() or half > that. The latter would be more appropriate for the machines I have, but > I'd worry that it would leave performance on the table for non-intel > machines. It would be a nice #leftoverbits project to do this dynamically at runtime, i.e. hook up the throughput code in progress.c to some new utility functions where the current code using pthreads would occasionally stop and try to find some (local) maximum throughput given N threads. You could then dynamically save that optimum for next time, or adjust threading at runtime every X seconds, e.g. on a server with N=24 cores you might want 24 threads if you have one index-pack, but if you have 24 index-packs you probably don't want each with 24 threads, for a total of 576.