On 3/17/25 10:03, King, Colin wrote: > Hi Christian, > > Follow-up below: > >> -----Original Message----- >> From: Christian Loehle <christian.loehle@xxxxxxx> >> Sent: 03 March 2025 22:25 >> To: King, Colin <colin.king@xxxxxxxxx>; Jens Axboe <axboe@xxxxxxxxx>; Rafael >> J. Wysocki <rafael@xxxxxxxxxx>; Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>; >> linux-block@xxxxxxxxxxxxxxx; linux-pm@xxxxxxxxxxxxxxx >> Cc: linux-kernel@xxxxxxxxxxxxxxx >> Subject: Re: [PATCH] cpuidle: psd: add power sleep demotion prevention for >> fast I/O devices >> >> On 3/3/25 16:43, Colin Ian King wrote: >>> Modern processors can drop into deep sleep states relatively quickly >>> to save power. However, coming out of deep sleep states takes a small >>> amount of time and this is detrimental to performance for I/O devices >>> such as fast PCIe NVME drives when servicing a completed I/O >>> transactions. >>> >>> Testing with fio with read/write RAID0 PCIe NVME devices on various >>> modern SMP based systems (such as 96 thead Granite Rapids Xeon 6741P) >>> has shown that on 85-90% of read/write transactions issued on a CPU >>> are completed by the same CPU, so it makes some sense to prevent the >>> CPU from dropping into a deep sleep state to help reduce I/O handling >>> latency. >> >> For the platform you tested on that may be true, but even if we constrain >> ourselves to pci-nvme there's a variety of queue/irq mappings where this >> doesn't hold I'm afraid. > > This code is optional, one can enable it or disable it via the config option. Also, > even when it is built-in one can disable it by writing 0 to the sysfs file > /sys/devices/system/cpu/cpuidle/psd_cpu_lat_timeout_ms > >> >>> >>> This commit introduces a simple, lightweight and fast power sleep >>> demotion mechanism that provides the block layer a way to inform the >>> menu governor to prevent a CPU from going into a deep sleep when an >>> I/O operation is requested. While it is true that some I/Os may not >> >> s/requested/completed is the full truth, isn't it? >> >>> be serviced on the same CPU that issued the I/O request and hence is >>> not 100% perfect the mechanism does work well in the vast majority of >>> I/O operations and there is very small overhead with the sleep >>> demotion prevention. >>> >>> Test results on a 96 thread Xeon 6741P with a 6 way RAID0 PCIe NVME md >>> array using fio 3.35 performing random read and read-write test on a >>> 512GB file with 8 concurrent I/O jobs. Tested with the >>> NHM_C1_AUTO_DEMOTE bit set in MSR_PKG_CST_CONFIG_CONTROL set in >> the BIOS. >>> >>> Test case: random reads, results based on geometic mean of results >>> from >>> 5 test runs: >>> Bandwidth IO-ops Latency Bandwidth >>> read (bytes/sec) per sec (ns) % Std.Deviation >>> Baseline: 21365755610 20377 390105 1.86% >>> Patched: 25950107558 24748 322905 0.16% >> >> What is the baseline? >> Do you mind trying with Rafael's recently posted series? >> Given the IOPS I'd expect good results from that alone already. >> https://lore.kernel.org/lkml/1916668.tdWV9SEqCh@xxxxxxxxxxxxx/ >> >> (Happy to see teo as comparison too, which you don't modify). > > OK, I re-ran the tests on 6.14-rc5 on the same H/W. The "Baseline" is 6.14-rc5 without > Raphel's patch. I also testing the Baseline with C6 and C6P disabled as this stops deeper > C-state sleeps and in theory should provide "best performance". I also benchmarked with > Raphel's patch and just my patch, and finally Raphels patch AND my patch: > > Reads > Bandwidth Bandwidth latency latency > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > Baseline 25689182477 0.15 326177 0.15 > C6, C6P disabled 25839580014 0.19 324349 0.19 > Raphels Patch: 25695523747 0.06 326150 0.06 > My patch: 25782011833 0.07 324999 0.07 > Raphel + My patch: 25792551514 0.10 324924 0.10 So these are mostly equal right? > > Writes > Bandwidth Bandwidth latency latency > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > Baseline 15220468898 3.33 550290 3.36 > C6, C6P disabled 13405624707 0.66 625424 0.66 > Raphels Patch: 14017625200 1.15 598049 1.16 > My patch: 15444417488 3.73 467818 29.10 > Raphel + My patch: 14037711032 1.13 597143 1.13 These don't make sense to me, why would C6 / C6P be this bad? Why would Rafael's patch make it worse? Are these just noise? > > Combined Read+Writes, Reads > > Bandwidth Bandwidth latency latency > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > Baseline 10132229433 0.41 484919 0.25 > C6, C6P disabled 10567536346 0.60 515260 1.16 > Raphels Patch: 10171044817 0.37 486937 0.20 > My patch: 10468953527 0.07 504797 0.07 > Raphel + My patch: 10174707546 1.26 488263 1.13 > > Combined Read+Writes, Writes > > Bandwidth Bandwidth latency latency > Bytes/sec %Std.Dev. nanosecs %Std.Dev. > Baseline 10139393169 0.44 342132 1.23 > C6, C6P disabled 10583264662 0.63 277052 3.87 > Raphels Patch: 10178275035 0.39 336989 0.94 > My patch: 10482766569 1.28 294803 6.87 > Raphel + My patch: 10183837235 0.38 330657 3.39 The above two indicate a +3% from (now mainline) Rafael's patch to yours. There's no reason why Rafael + your patch should be worse than just your patch. If this is statistically significant we would need to look into why. FWIW I think the energy cost of essentially remaining in polling even for quite sporadic IO can't be understated IMO. Comparison for teo, which might slightly better than menu while not outright disabling idle would be interesting still.