On Thu, May 28, 2020 at 06:11:12PM +0200, Peter Zijlstra wrote: > > FWIW, I think you're referring to Mel's notice in OSPM regarding the overhead. > > Trying to see what goes on in there. > > Indeed, that one. The fact that regular distros cannot enable this > feature due to performance overhead is unfortunate. It means there is a > lot less potential for this stuff. During that talk, I was a vague about the cost, admitted I had not looked too closely at mainline performance and had since deleted the data given that the problem was first spotted in early April. If I heard someone else making statements like I did at the talk, I would consider it a bit vague, potentially FUD, possibly wrong and worth rechecking myself. In terms of distributions "cannot enable this", we could but I was unwilling to pay the cost for a feature no one has asked for yet. If they had, I would endevour to put it behind static branches and disable it by default (like what happened for PSI). I was contacted offlist about my comments at OSPM and gathered new data to respond properly. For the record, here is an editted version of my response; --8<-- (Some context deleted that is not relevant) > Does it need any special admin configuration for system > services, cgroups, scripts, etc? Nothing special -- out of box configuration. Tests were executed via mmtests. > Which mmtests config file did you use? > I used network-netperf-unbound and network-netperf-cstate. network-netperf-unbound is usually the default but for some issues, I use the cstate configuration to limit C-states. For a perf profile, I used network-netperf-cstate-small and network-netperf-unbound-small to limit the amount of profile data that was collected. Just collecting data for 64 byte buffers was enough. > The server that I am going to configure is x86_64 numa, not arm64. That's fine, I didn't actually test arm64 at all. > I have a 2 socket 24 CPUs X86 server (4 NUMA nodes, AMD Opteron 6174, > L2 512KB/cpu, L3 6MB/node, RAM 40GB/node). > Which machine did you run it on? > It was a 2-socket Haswell machine (E5-2670 v3) with 2 NUMA nodes. I used 5.7-rc7 with the openSUSE Leap 15.1 kernel configuration as a baseline. I compared with and without uclamp enabled. For network-netperf-unbound I see netperf-udp 5.7.0-rc7 5.7.0-rc7 with-clamp without-clamp Hmean send-64 238.52 ( 0.00%) 257.28 * 7.87%* Hmean send-128 477.10 ( 0.00%) 511.57 * 7.23%* Hmean send-256 945.53 ( 0.00%) 982.50 * 3.91%* Hmean send-1024 3655.74 ( 0.00%) 3846.98 * 5.23%* Hmean send-2048 6926.84 ( 0.00%) 7247.04 * 4.62%* Hmean send-3312 10767.47 ( 0.00%) 10976.73 ( 1.94%) Hmean send-4096 12821.77 ( 0.00%) 13506.03 * 5.34%* Hmean send-8192 22037.72 ( 0.00%) 22275.29 ( 1.08%) Hmean send-16384 35935.31 ( 0.00%) 34737.63 * -3.33%* Hmean recv-64 238.52 ( 0.00%) 257.28 * 7.87%* Hmean recv-128 477.10 ( 0.00%) 511.57 * 7.23%* Hmean recv-256 945.45 ( 0.00%) 982.50 * 3.92%* Hmean recv-1024 3655.74 ( 0.00%) 3846.98 * 5.23%* Hmean recv-2048 6926.84 ( 0.00%) 7246.51 * 4.62%* Hmean recv-3312 10767.47 ( 0.00%) 10975.93 ( 1.94%) Hmean recv-4096 12821.76 ( 0.00%) 13506.02 * 5.34%* Hmean recv-8192 22037.71 ( 0.00%) 22274.55 ( 1.07%) Hmean recv-16384 35934.82 ( 0.00%) 34737.50 * -3.33%* netperf-tcp 5.7.0-rc7 5.7.0-rc7 with-clamp without-clamp Min 64 2004.71 ( 0.00%) 2033.23 ( 1.42%) Min 128 3657.58 ( 0.00%) 3733.35 ( 2.07%) Min 256 6063.25 ( 0.00%) 6105.67 ( 0.70%) Min 1024 18152.50 ( 0.00%) 18487.00 ( 1.84%) Min 2048 28544.54 ( 0.00%) 29218.11 ( 2.36%) Min 3312 33962.06 ( 0.00%) 36094.97 ( 6.28%) Min 4096 36234.82 ( 0.00%) 38223.60 ( 5.49%) Min 8192 42324.06 ( 0.00%) 43328.72 ( 2.37%) Min 16384 44323.33 ( 0.00%) 45315.21 ( 2.24%) Hmean 64 2018.36 ( 0.00%) 2038.53 * 1.00%* Hmean 128 3700.12 ( 0.00%) 3758.20 * 1.57%* Hmean 256 6236.14 ( 0.00%) 6212.77 ( -0.37%) Hmean 1024 18214.97 ( 0.00%) 18601.01 * 2.12%* Hmean 2048 28749.56 ( 0.00%) 29728.26 * 3.40%* Hmean 3312 34585.50 ( 0.00%) 36345.09 * 5.09%* Hmean 4096 36777.62 ( 0.00%) 38576.17 * 4.89%* Hmean 8192 43149.08 ( 0.00%) 43903.77 * 1.75%* Hmean 16384 45478.27 ( 0.00%) 46372.93 ( 1.97%) The cstate-limited config had similar results for UDP_STREAM but was mostly indifferent for TCP_STREAM. So for UDP_STREAM,. there is a fairly sizable difference for uclamp. There are caveats, netperf is not 100% stable from a performance perspective on NUMA machines. That's improved quite a bit with 5.7 but it still should be treated with care. When I first saw a problem, I was using ftrace looking for latencies and uclamp appeared to crop up. As I didn't actually need uclamp and there was no user request to support it, I simply dropped it from the master config so it would get propogated to any distro we release with a 5.x kernel. >From a perf profile, it's not particularly obvious that uclamp is involved so it could be in error but I doubt it. A diff of without vs with looks like # Event 'cycles:ppp' # # Baseline Delta Abs Shared Object Symbol # ........ ......... ........................ .............................................. # 9.59% -2.87% [kernel.vmlinux] [k] poll_idle 0.19% +1.85% [kernel.vmlinux] [k] activate_task +1.17% [kernel.vmlinux] [k] dequeue_task +0.89% [kernel.vmlinux] [k] update_rq_clock.part.73 3.88% +0.73% [kernel.vmlinux] [k] try_to_wake_up 3.17% +0.68% [kernel.vmlinux] [k] __schedule 1.16% -0.60% [kernel.vmlinux] [k] __update_load_avg_cfs_rq 2.20% -0.54% [kernel.vmlinux] [k] resched_curr 2.08% -0.29% [kernel.vmlinux] [k] _raw_spin_lock_irqsave 0.44% -0.29% [kernel.vmlinux] [k] cpus_share_cache 1.13% +0.23% [kernel.vmlinux] [k] _raw_spin_lock_bh A lot of the uclamp functions appear to be inlined so it is not be particularly obvious from a raw profile but it shows up in the annotated profile in activate_task and dequeue_task for example. In the case of dequeue_task, uclamp_rq_dec_id() is extremely expensive according to the annotated profile. I'm afraid I did not dig into this deeply once I knew I could just disable it even within the distribution. -- Mel Gorman SUSE Labs