On Wed, Mar 01, 2023 at 05:32:14PM +0100, Antonio Paolillo wrote: > Dear all, > > I want to provide some support to Hernan regarding performance claims. > > I used lock_torture to evaluate the different proposed patches on two > different server machines: > - a Huawei TaiShan 200 (Model 2280) rack server that has 128 GB of RAM > and 2x Kunpeng 920-4826 processors, a HiSilicon chip with 48 ARMv8.2 > 64-bit cores totaling 96 cores (no SMT) [1, 2], > denoted as taishan200-96c; > - a GIGABYTE R182-Z91-00 rack server that has 128 GB of RAM and 2x > EPYC 7352 processors, an AMD chip with 24 x86_64 cores, totaling 48 > cores (96 CPUs when counting hyperthreading) [3, 4], > denoted as gigabyte-96c. > > I ran the evaluation on a Ubuntu 22.04 distro, with custom kernels based > on v6.2-rc6 (6d796c50f84ca79f1722bb131799e5a5710c4700). > The different kernels are combination of patches: > - (0) Stock kernel; > - (1) With relaxed set owner barrier (as discussed in [5] and questioned > by Peter, the barrier seems not to be needed); > - (2) With READ_ONCE(), as originally proposed in this thread; > - (3) With atomic_long_or() as proposed by Peter; > - (4) With relaxed set owner barrier and READ_ONCE(); > - (5) With relaxed set owner barrier and atomic_long_or(). > > I ran lock_torture several times, exploring the following parameter > space: > - torture_type="rtmutex_lock", > - nwriters_stress=[1, 2, 3, 4, 8, 16, 32, 64, 95], > - stat_interval=4, > - stutter=0, > - shuffle_interval=0. > For each value of "nwriters_stress", I ran the configuration 5 times. > > By feeding the lock_torture kthread pids to "taskset -p", I overruled > the scheduling such that the distribution of kthreads to CPUs is fixed. > I also disabled "irq balance" and "numa balance" daemons, fixed the > frequency to 1.5GHz using the "userspace" cpufreq governor and isolated > all the cores used (using isolcpus=1-95 at boot-time) to avoid any > source of interference. > > As a warm-up phase, I ignored the first reported results and only > considered the latest 60 seconds of execution (after all kthreads > migrated to their final CPU). > The reported throughput is computed by dividing the reported number of > operations by the duration of the measurement for each dot (60 seconds), > so higher is better. > > Here follows the results on taishan200-96c (the 'rel' column is the mean > relative to the mean of the stock kernel, in percent, and each mean is > the average over 5 independent runs): > > Kernel: k0-stock-6.2.0-rc6 k1-rmacq k2-readonce k3-alongor k4-rmacq+readonce k5-rmacq+alongor > Statistic (kops/s): mean std mean std rel mean std rel mean std rel mean std rel mean std rel > nwriters_stress: > 1 899.91 24.95 880.10 29.62 -2% 871.57 44.27 -3% 888.65 37.90 -1% 898.63 29.82 -0% 889.83 25.64 -1% > 2 359.30 25.92 416.83 32.77 +16% 360.65 28.32 +0% 404.79 42.64 +13% 380.65 21.29 +6% 404.37 23.27 +13% > 3 314.97 24.32 308.41 9.68 -2% 315.00 9.97 +0% 313.86 13.47 -0% 313.47 4.01 -0% 322.77 20.82 +2% > 4 328.02 15.09 330.65 29.33 +1% 314.83 24.28 -4% 305.71 12.72 -7% 322.95 10.39 -2% 343.32 13.73 +5% > 8 292.16 22.03 288.85 10.50 -1% 288.28 18.84 -1% 285.42 24.58 -2% 310.23 26.08 +6% 285.67 20.03 -2% > 16 297.03 26.89 281.89 29.22 -5% 265.19 33.73 -11% 279.02 22.43 -6% 284.40 36.21 -4% 285.21 36.33 -4% > 32 187.36 28.59 175.71 19.77 -6% 186.44 48.15 -0% 206.59 14.11 +10% 174.08 24.30 -7% 185.80 45.12 -1% > 64 148.13 48.65 172.48 34.29 +16% 154.59 47.05 +4% 164.22 29.81 +11% 142.13 47.40 -4% 136.39 29.95 -8% > 95 174.35 57.89 148.59 38.03 -15% 156.85 43.64 -10% 132.92 32.35 -24% 126.44 28.24 -27% 146.82 60.04 -16% > > And the results on gigabyte-96c: > > Kernel: k0-stock-6.2.0-rc6 k1-rmacq k2-readonce k3-alongor k4-rmacq+readonce k5-rmacq+alongor > Statistic (kops/s): mean std mean std rel mean std rel mean std rel mean std rel mean std rel > nwriters_stress: > 1 713.72 25.68 707.32 17.73 -1% 718.81 12.63 +1% 712.80 13.57 -0% 709.17 14.10 -1% 730.33 9.14 +2% > 2 376.25 8.19 400.09 16.24 +6% 396.71 26.09 +5% 412.61 17.80 +10% 396.48 7.02 +5% 409.90 14.61 +9% > 3 415.07 16.83 410.19 19.82 -1% 423.39 9.68 +2% 417.28 10.23 +1% 424.94 17.48 +2% 422.92 11.75 +2% > 4 286.77 26.63 285.13 6.80 -1% 297.33 23.62 +4% 296.49 16.60 +3% 303.99 30.38 +6% 296.93 9.90 +4% > 8 296.56 20.45 308.97 12.53 +4% 305.49 19.91 +3% 294.24 17.24 -1% 294.71 24.03 -1% 294.09 25.20 -1% > 16 257.34 33.94 266.03 29.60 +3% 270.72 35.22 +5% 252.28 50.16 -2% 263.83 45.84 +3% 247.42 41.01 -4% > 32 278.78 51.45 215.35 68.40 -23% 259.77 87.44 -7% 217.26 79.67 -22% 201.23 70.46 -28% 282.47 116.65 +1% > 64 75.82 64.87 194.52 137.19 +157% 35.57 12.14 -53% 74.24 72.04 -2% 71.29 45.55 -6% 77.93 43.57 +3% > 95 60.37 68.13 198.38 116.93 +229% 43.12 17.60 -29% 58.80 36.47 -3% 57.78 63.00 -4% 61.33 71.18 +2% > > We can safely conclude that the patches do not significatively affect > the throughput of the lock_torture benchmark for rtmutex_lock. > The values for nwriters_stress>=64 can safely be ignored as they are too > spread. Just so you know, locktorture is intended to be a stress test rather than a performance benchmark. Hugo Guiroux's dissertation gives a much better locking performance methodology: https://hugoguiroux.github.io/assets/these.pdf Thanx, Paul > Please notice that I pushed a landing page [6] with results in HTML that > may be more convenient to browse together with interactive charts. > > Cheers, > > Antonio > > [1] https://e.huawei.com/uk/products/servers/taishan-server/taishan-2280-v2 > [2] https://en.wikichip.org/wiki/hisilicon/kunpeng/920-4826 > [3] https://www.gigabyte.com/Rack-Server/R182-Z91-rev-100 > [4] https://www.amd.com/en/products/cpu/amd-epyc-7352 > [5] https://lkml.org/lkml/2023/1/22/160 > [6] https://antonio.paolillo.be/public/rtlocks-locktorture-patches.html >