Re: [PATCH RFC 5.13 1/2] io_uring: add support for ns granularity of io_sq_thread_idle

Hao Xu <haoxu@xxxxxxxxxxxxxxxxx> · Wed, 29 Sep 2021 15:52:14 +0800

在 2021/9/28 下午6:51, Pavel Begunkov 写道:
On 9/26/21 11:00 AM, Hao Xu wrote:
在 2021/4/30 上午6:15, Pavel Begunkov 写道:
On 4/29/21 4:28 AM, Hao Xu wrote:
在 2021/4/28 下午10:07, Pavel Begunkov 写道:
On 4/28/21 2:32 PM, Hao Xu wrote:
currently unit of io_sq_thread_idle is millisecond, the smallest value
is 1ms, which means for IOPS > 1000, sqthread will very likely  take
100% cpu usage. This is not necessary in some cases, like users may
don't care about latency much in low IO pressure
(like 1000 < IOPS < 20000), but cpu resource does matter. So we offer
an option of nanosecond granularity of io_sq_thread_idle. Some test
results by fio below:

If numbers justify it, I don't see why not do it in ns, but I'd suggest
to get rid of all the mess and simply convert to jiffies during ring
creation (i.e. nsecs_to_jiffies64()), and leave io_sq_thread() unchanged.
1) here I keep millisecond mode for compatibility
2) I saw jiffies is calculated by HZ, and HZ could be large enough
(like HZ = 1000) to make nsecs_to_jiffies64() = 0:

   u64 nsecs_to_jiffies64(u64 n)
   {
   #if (NSEC_PER_SEC % HZ) == 0
           /* Common case, HZ = 100, 128, 200, 250, 256, 500, 512, 1000 etc. */
           return div_u64(n, NSEC_PER_SEC / HZ);
   #elif (HZ % 512) == 0
           /* overflow after 292 years if HZ = 1024 */
           return div_u64(n * HZ / 512, NSEC_PER_SEC / 512);
   #else
           /*
           ¦* Generic case - optimized for cases where HZ is a multiple of 3.
           ¦* overflow after 64.99 years, exact for HZ = 60, 72, 90, 120 etc.
           ¦*/
           return div_u64(n * 9, (9ull * NSEC_PER_SEC + HZ / 2) / HZ);
   #endif
   }

say HZ = 1000, then nsec_to_jiffies64(1us) = 1e3 / (1e9 / 1e3) = 0
iow, nsec_to_jiffies64() doesn't work for n < (1e9 / HZ).

Agree, apparently jiffies precision fractions of a second, e.g. 0.001s
But I'd much prefer to not duplicate all that. So, jiffies won't do,
ktime() may be ok but a bit heavier that we'd like it to be...

Jens, any chance you remember something in the middle? Like same source
as ktime() but without the heavy correction it does.
I'm gonna pick this one up again, currently this patch
with ktime_get_ns() works good on our productions. This
patch makes the latency a bit higher than before, but
still lower than aio.
I haven't gotten a faster alternate for ktime_get_ns(),
any hints?

Good, I'd suggest to look through Documentation/core-api/timekeeping.rst
In particular coarse variants may be of interest.
https://www.kernel.org/doc/html/latest/core-api/timekeeping.html#coarse-and-fast-ns-access

Off topic: it sounds that you're a long user of SQPOLL. Interesting to
ask how do you find it in general. i.e. does it help much with
latency? Performance? Anything else?
It helps with the latency and iops(can not surely recall the number now..)
It is useful when many user threads offload IO to just one sqthread.