Re: Resolving deltas dominates clone time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Apr 30 2019, Jeff King wrote:

> On Tue, Apr 23, 2019 at 05:08:40PM +0700, Duy Nguyen wrote:
>
>> On Tue, Apr 23, 2019 at 11:45 AM Jeff King <peff@xxxxxxxx> wrote:
>> >
>> > On Mon, Apr 22, 2019 at 09:55:38PM -0400, Jeff King wrote:
>> >
>> > > Here are my p5302 numbers on linux.git, by the way.
>> > >
>> > >   Test                                           jk/p5302-repeat-fix
>> > >   ------------------------------------------------------------------
>> > >   5302.2: index-pack 0 threads                   307.04(303.74+3.30)
>> > >   5302.3: index-pack 1 thread                    309.74(306.13+3.56)
>> > >   5302.4: index-pack 2 threads                   177.89(313.73+3.60)
>> > >   5302.5: index-pack 4 threads                   117.14(344.07+4.29)
>> > >   5302.6: index-pack 8 threads                   112.40(607.12+5.80)
>> > >   5302.7: index-pack default number of threads   135.00(322.03+3.74)
>> > >
>> > > which still imply that "4" is a win over "3" ("8" is slightly better
>> > > still in wall-clock time, but the total CPU rises dramatically; that's
>> > > probably because this is a quad-core with hyperthreading, so by that
>> > > point we're just throttling down the CPUs).
>> >
>> > And here's a similar test run on a 20-core Xeon w/ hyperthreading (I
>> > tweaked the test to keep going after eight threads):
>> >
>> > Test                            HEAD
>> > ----------------------------------------------------
>> > 5302.2: index-pack 1 threads    376.88(364.50+11.52)
>> > 5302.3: index-pack 2 threads    228.13(371.21+17.86)
>> > 5302.4: index-pack 4 threads    151.41(387.06+21.12)
>> > 5302.5: index-pack 8 threads    113.68(413.40+25.80)
>> > 5302.6: index-pack 16 threads   100.60(511.85+37.53)
>> > 5302.7: index-pack 32 threads   94.43(623.82+45.70)
>> > 5302.8: index-pack 40 threads   93.64(702.88+47.61)
>> >
>> > I don't think any of this is _particularly_ relevant to your case, but
>> > it really seems to me that the default of capping at 3 threads is too
>> > low.
>>
>> Looking back at the multithread commit, I think the trend was the same
>> and I capped it because the gain was not proportional to the number of
>> cores we threw at index-pack anymore. I would not be opposed to
>> raising the cap though (or maybe just remove it)
>
> I'm not sure what the right cap would be. I don't think it's static;
> we'd want ~4 threads on the top case, and 10-20 on the bottom one.
>
> It does seem like there's an inflection point in the graph at N/2
> threads. But then maybe that's just because these are hyper-threaded
> machines, so "N/2" is the actual number of physical cores, and the
> inflated CPU times above that are just because we can't turbo-boost
> then, so we're actually clocking slower. Multi-threaded profiling and
> measurement is such a mess. :)
>
> So I'd say the right answer is probably either online_cpus() or half
> that. The latter would be more appropriate for the machines I have, but
> I'd worry that it would leave performance on the table for non-intel
> machines.

It would be a nice #leftoverbits project to do this dynamically at
runtime, i.e. hook up the throughput code in progress.c to some new
utility functions where the current code using pthreads would
occasionally stop and try to find some (local) maximum throughput given
N threads.

You could then dynamically save that optimum for next time, or adjust
threading at runtime every X seconds, e.g. on a server with N=24 cores
you might want 24 threads if you have one index-pack, but if you have 24
index-packs you probably don't want each with 24 threads, for a total of
576.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux