在 2021/4/29 下午5:28, Pavel Begunkov 写道:
On 4/29/21 5:37 AM, Hao Xu wrote:
在 2021/4/28 下午10:37, Pavel Begunkov 写道:
On 4/28/21 3:34 PM, Pavel Begunkov wrote:
On 4/28/21 2:32 PM, Hao Xu wrote:
sqes are submitted by sqthread when it is leveraged, which means there
is IO latency when waking up sqthread. To wipe it out, submit limited
number of sqes in the original task context.
Tests result below:
Frankly, it can be a nest of corner cases if not now then in the future,
leading to a high maintenance burden. Hence, if we consider the change,
I'd rather want to limit the userspace exposure, so it can be removed
if needed.
A noticeable change of behaviour here, as Hao recently asked, is that
the ring can be passed to a task from a completely another thread group,
and so the feature would execute from that context, not from the
original/sqpoll one.
So maybe something like:
if (same_thread_group()) {
/* submit */
}I thought this case(cross independent processes) for some time, Pavel,
could you give more hints about how this may trigger errors?
Currently? We need to audit cancellation, but don't think it's a problem.
But as said, it's about the future. Your patch adds a new quirky
userspace behaviour (submitting from alien context as described), and
once commited it can't be removed and should be maintained.
I can easily imagine it either over-complicating cancellations (and
we had enough of troubles with it), or just preventing more important
optimisations, or anything else often happening with new features.
Not sure IORING_ENTER_SQ_DEPUTY knob is needed, but at least can be
ignored if the previous point is addressed.
I'd question whether it'd be better with the flag or without doing
this feature by default.
Just like what Jens said, the flag here is to allow users to do their
decision, there may be cases like a application wants to offload as much
as possible IO related work to sqpoll, so that it can be dedicated to
computation work etc.
99th latency:
iops\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us
with this patch:
2k 13 13 12 13 13 12 12 11 11 10.304 11.84
without this patch:
2k 15 14 15 15 15 14 15 14 14 13 11.84
Not sure the second nine describes it well enough, please can you
add more data? Mean latency, 50%, 90%, 99%, 99.9%, t-put.
Sure, I will.
Forgot but it's important, should compared with non-sqpoll as well
because the feature is taking the middle ground between them.
Hi Pavel,
I found that the previous tests are not correct since when I did the
test I enabled IOPOLL and in io_uring_enter(), I didn't add a if:
if (IOPOLL disabled)
submit at most 8 sqes in original context
this optimization cannot run under SQPOLL+IOPOLL.
So I did another tests with my new code, under 2k IOPS, against an
Intel optane device. The fio command is:
echo "
[global]
ioengine=io_uring
sqthread_poll=1
thread=1
bs=4k
direct=1
rw=randread
time_based=1
runtime=160
group_reporting=1
filename=/dev/nvme1n1
sqthread_poll_cpu=30
randrepeat=0
[job0]
cpus_allowed=35
iodepth=128
rate_iops=${1}
io_sq_thread_idle=${2}" | ./fio/fio -
the result is:
(
1. big numbers in 'th' are in ns, others in us,
2. for no sqpoll mode, sq_idle doesn't affect the metrics, so they're
just repetitive tests.
)
2k+idle_submit_min8
metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us
min_lat 9 9 9 9 9 9 9 9 9 9 8
max_lat 2233 2802 3000 3518 3078 2052 2583 2914 4176 2255 3275
avg_lat 11.26 11.25 11.24 11.29 11.29 11.21 11.24 11.21 11.23 11.27 9.53
50th 7648 7648 7648 7648 7648 7648 7648 7584 7584 7648 9152
90th 7840 7840 7840 7840 7840 7840 7840 7776 7840 7904 9408
99th 10944 11200 11712 11328 12096 12352 11584 12096 11712 11072 12992
99.99th 73216 38144 56576 38656 38144 70144 40192 38144 40704 38144 39680
fio(usr) 3.66% 3.53% 3.80% 3.60% 3.41% 3.93% 3.55% 3.57% 3.58% 3.75% 3.82%
fio(sys) 0.77% 0.70% 0.64% 0.65% 0.82% 0.71% 0.69% 0.69% 0.68% 0.72% 0.07%
fio(usr+sys) 4.43% 4.23% 4.44% 4.25% 4.23% 4.64% 4.24% 4.26% 4.26%
4.47% 3.89%
2k+idle_without_submit_min8
metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us
min_lat 11 8 8 8 8 8 8 8 8 8 8
max_lat 12643 11703 11617 3050 8662 3402 5179 4307 2897 4204 3017
avg_lat 12.75 12.77 12.5 12.43 12.5 12.39 12.39 12.33 12.16 12.1 9.44
50th 11 11 11 11 11 11 11 11 11 11 9024
90th 11 11 11 11 11 11 11 11 11 11 9280
99th 18 17 15 16 16 16 16 16 15 15 13120
99.99th 930 1090 635 635 693 652 717 635 510 330 39680
fio(usr) 4.59% 3.29% 4.06% 4.10% 4.18% 4.30% 4.73% 4.12% 4.11% 3.83% 3.80%
fio(sys) 0.39% 0.45% 0.46% 0.42% 0.34% 0.39% 0.37% 0.37% 0.33% 0.39% 0.07%
fio(usr+sys) 4.98% 3.74% 4.52% 4.52% 4.52% 4.69% 5.10% 4.49% 4.44%
4.22% 3.87%
2k+without_sqpoll
metrics\idle 10us 60us 110us 160us 210us 260us 310us 360us 410us 460us 510us
min_lat 8 8 8 7 8 8 8 8 8 8 8
max_lat 623 602 273 388 271 290 891 495 519 396 245
avg_lat 13.12 13.14 13.15 13.1 13.16 13.06 13.06 13.15 13.12 13.14 13.1
50th 10560 10560 10560 10560 10560 10432 10560 10560 10560 10560 10560
90th 10816 10816 10944 10816 10816 10816 10816 10944 10816 10944 10816
99th 14144 14656 14272 13888 14400 14016 14400 14272 14400 14528 14272
99.99th 42752 50432 43246 41216 56576 42240 43264 42240 47360 58624 42240
fio(usr) 2.31% 2.33% 2.81% 2.30% 2.34% 2.34% 2.12% 2.26% 2.09% 2.36% 2.34%
fio(sys) 0.82% 0.79% 0.93% 0.81% 0.79% 0.77% 0.79% 0.86% 1.02% 0.76% 0.79%
fio(usr+sys) 3.13% 3.12% 3.74% 3.11% 3.13% 3.11% 2.91% 3.12% 3.11%
3.12% 3.13%
Cpu usage of sqthread hasn't been recorded, but I think the trending
would be like the result in the previous email.
From the data, the 'submit in original context' change reduce the avg
latency and 'th' latency, meanwhile costs too much cpu usage of the
original context. may be 'submit min(to_submit, 8)' is too much in 2k
IOPS. I'll try other combination of IOPS and idle time.
Thanks,
Hao
Btw, how happened that only some of the numbers have fractional part?
Can't believe they all but 3 were close enough to integer values.
This confused me a little bit too, but it is indeed what fio outputs.
That's just always when I see such, something tells me that data has
been manipulated. Even if it's fio, it's really weird and suspicious,
and worth to look what's wrong with it.
Could you tell me what it is.
fio config:
./run_fio.sh
fio \
--ioengine=io_uring --sqthread_poll=1 --hipri=1 --thread=1 --bs=4k \
--direct=1 --rw=randread --time_based=1 --runtime=300 \
--group_reporting=1 --filename=/dev/nvme1n1 --sqthread_poll_cpu=30 \
--randrepeat=0 --cpus_allowed=35 --iodepth=128 --rate_iops=${1} \
--io_sq_thread_idle=${2}
Signed-off-by: Hao Xu <haoxu@xxxxxxxxxxxxxxxxx>
---
fs/io_uring.c | 29 +++++++++++++++++++++++------
include/uapi/linux/io_uring.h | 1 +
2 files changed, 24 insertions(+), 6 deletions(-)
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1871fad48412..f0a01232671e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1252,7 +1252,12 @@ static void io_queue_async_work(struct io_kiocb *req)
{
struct io_ring_ctx *ctx = req->ctx;
struct io_kiocb *link = io_prep_linked_timeout(req);
- struct io_uring_task *tctx = req->task->io_uring;
+ struct io_uring_task *tctx = NULL;
+
+ if (ctx->sq_data && ctx->sq_data->thread)
+ tctx = ctx->sq_data->thread->io_uring;
without park it's racy, sq_data->thread may become NULL and removed,
as well as its ->io_uring.
+ else
+ tctx = req->task->io_uring;
BUG_ON(!tctx);
BUG_ON(!tctx->io_wq);
[snip]