Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Thu, 29 Apr 2021 11:28:04 +0800

在 2021/4/28 下午10:07, Pavel Begunkov 写道:
On 4/28/21 2:32 PM, Hao Xu wrote:
currently unit of io_sq_thread_idle is millisecond, the smallest value
is 1ms, which means for IOPS > 1000, sqthread will very likely  take
100% cpu usage. This is not necessary in some cases, like users may
don't care about latency much in low IO pressure
(like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
an option of nanosecond granularity of io_sq_thread_idle. Some test
results by fio below:

If numbers justify it, I don't see why not do it in ns, but I'd suggest
to get rid of all the mess and simply convert to jiffies during ring
creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
1) here I keep millisecond mode for compatibility
2) I saw jiffies is calculated by HZ, and HZ could be large enough
(like HZ = 1000) to make nsecs_to_jiffies64() = 0:

  u64 nsecs_to_jiffies64(u64 n)
  {
  #if (NSEC_PER_SEC % HZ) == 0
          /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 
etc. */
          return div_u64(n, NSEC_PER_SEC / HZ);
  #elif (HZ % 512) == 0
          /* overflow after 292 years if HZ = 1024 */
          return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
  #else
          /*
          ¦* Generic case - optimized for cases where HZ is a multiple 
of 3.
          ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 
etc.
          ¦*/
          return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
  #endif
  }

say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).


Or is there a reason for having it high precision, i.e. ktime()?

uring average latency:(us)
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
2k	        10.93	10.68	10.72	10.7	10.79	10.52	10.59	10.54	10.47	10.39	8.4
4k	        10.55	10.48	10.51	10.42	10.35	8.34
6k	        10.82	10.5	10.39	8.4
8k	        10.44	10.45	10.34	8.39
10k	        10.45	10.39	8.33

uring cpu usage of sqthread:
iops\idle	10us	60us	110us	160us	210us	260us	310us	360us	410us	460us	510us
2k	        4%	14%	24%	34.70%	44.70%	55%	65.10%	75.40%	85.40%	95.70%	100%
4k	        7.70%	28.20%	48.50%	69%	90%	100%
6k	        11.30%	42%	73%	100%
8k	        15.30%	56.30%	97%	100%
10k	        19%	70%	100%

aio average latency:(us)
iops	latency	99th lat  cpu
2k	13.34	14.272    3%
4k	13.195	14.016	  7%
6k	13.29	14.656	  9.70%
8k	13.2	14.656	  12.70%
10k	13.2	15	  17%

fio config is:
./run_fio.sh
fio \
--ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
--direct=1 --rw=randread --time_based=1 --runtime=300 \
--group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
--randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
--io_sq_thread_idle=${2}

in 2k IOPS, if latency of 10.93us is acceptable for an application,
then they get 100% - 4% = 96% reduction of cpu usage, while the latency
is smaller than aio(10.93us vs 13.34us).

Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
[snip]

diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index e1ae46683301..311532ff6ce3 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -98,6 +98,7 @@ enum {
  #define IORING_SETUP_CLAMP	(1U << 4)	/* clamp SQ/CQ ring sizes */
  #define IORING_SETUP_ATTACH_WQ	(1U << 5)	/* attach to existing wq */
  #define IORING_SETUP_R_DISABLED	(1U << 6)	/* start with ring disabled */
+#define IORING_SETUP_IDLE_NS	(1U << 7)	/* unit of thread_idle is nano second */
  
  enum {
  	IORING_OP_NOP,
@@ -259,7 +260,7 @@ struct io_uring_params {
  	__u32 cq_entries;
  	__u32 flags;
  	__u32 sq_thread_cpu;
-	__u32 sq_thread_idle;
+	__u64 sq_thread_idle;

breaks userspace API

  	__u32 features;
  	__u32 wq_fd;
  	__u32 resv[3];