On 2/25/20 3:12 AM, Pavel Begunkov wrote: > On 2/25/2020 6:13 AM, Jens Axboe wrote: >>>> I still think flags tagged on sqes could be a better choice, which >>>> gives users an ability to deside if they want to ignore the cqes, not >>>> only for links, but also for normal sqes. >>>> >>>> In addition, boxed cqes couldn’t resolve the issue of >>>> IORING_IO_TIMEOUT. >>> >>> I would tend to agree, and it'd be trivial to just set the flag on >>> whatever SQEs in the chain you don't care about. Or even an individual >>> SQE, though that's probably a bit more of a reach in terms of use case. >>> Maybe nop with drain + ignore? > > Flexible, but not performant. The existence of drain is already makes > io_uring to do a lot of extra stuff, and even worse when it's actually used. Yeah I agree, that's assuming we can make the drain more efficient. Just hand waving on possible use cases :-) >>> In any case it's definitely more flexible. > > That's a different thing. Knowing how requests behave (e.g. if > nbytes!=res, then fail link), one would want to get cqe for the last > executed sqe, whether it's an error or a success for the last one. > > It makes a link to be handled as a single entity. I don't see a way to > emulate similar behaviour with the unconditional masking. Probably, we > will need them both. But you can easily do that with IOSQE_NO_CQE, in fact that's what I did to test this. The chain will have IOSQE_NO_CQE | IOSQE_IO_LINK set on all but the last request. >> In the interest of taking this to the extreme, I tried a nop benchmark >> on my laptop (qemu/kvm). Granted, this setup is particularly sensitive >> to spinlocks, they are a lot more expensive there than on a real host. >> >> Anyway, regular nops run at about 9.5M/sec with a single thread. >> Flagging all SQEs with IOSQE_NO_CQE nets me about 14M/sec. So a handy >> improvement. Looking at the top of profiles: >> >> cqe-per-sqe: >> >> + 28.45% io_uring [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore >> + 14.38% io_uring [kernel.kallsyms] [k] io_submit_sqes >> + 9.38% io_uring [kernel.kallsyms] [k] io_put_req >> + 7.25% io_uring libc-2.31.so [.] syscall >> + 6.12% io_uring [kernel.kallsyms] [k] kmem_cache_free >> >> no-cqes: >> >> + 19.72% io_uring [kernel.kallsyms] [k] io_put_req >> + 11.93% io_uring [kernel.kallsyms] [k] io_submit_sqes >> + 10.14% io_uring [kernel.kallsyms] [k] kmem_cache_free >> + 9.55% io_uring libc-2.31.so [.] syscall >> + 7.48% io_uring [kernel.kallsyms] [k] __io_queue_sqe >> >> I'll try the real disk IO tomorrow, using polled IO. > > Great, would love to see My box with the optane2 is out of commission, apparently, cannot get it going today. So I had to make do with my laptop, which does about ~600K random read IOPS. I don't see any difference there, using polled IO, using 4 link deep chains (so 1/4th the CQEs). Both run at around 611-613K IOPS. -- Jens Axboe