Re: [RFC] single cqe per link

Pavel Begunkov <asml.silence@xxxxxxxxx> · Wed, 26 Feb 2020 00:13:01 +0300

On 25/02/2020 23:20, Jens Axboe wrote:
> On 2/25/20 3:12 AM, Pavel Begunkov wrote:
>> Flexible, but not performant. The existence of drain is already makes
>> io_uring to do a lot of extra stuff, and even worse when it's actually used.
> 
> Yeah I agree, that's assuming we can make the drain more efficient. Just
> hand waving on possible use cases :-)

I don't even know what to do with sequences and drains when we get to in-kernel
sqe generation. And the current linear numbering won't be the case at all.

E.g. req1 -> DRAIN, and req1 infinitely generates req2, req3, etc. Should they
go before DRAIN? or at any time? What would be performance burden for it?..

I'd rather forbid them for using with some new features. And that's the reason
behind the question about wideness of its use.

>>
>> That's a different thing. Knowing how requests behave (e.g. if
>> nbytes!=res, then fail link), one would want to get cqe for the last
>> executed sqe, whether it's an error or a success for the last one.
>>
>> It makes a link to be handled as a single entity. I don't see a way to
>> emulate similar behaviour with the unconditional masking. Probably, we
>> will need them both.
> 
> But you can easily do that with IOSQE_NO_CQE, in fact that's what I did
> to test this. The chain will have IOSQE_NO_CQE | IOSQE_IO_LINK set on
> all but the last request.

It's fine if you don't expect it to fail. Otherwise, there will be only
-ECANCELELED for the last one, so you don't know error code nor failed
req/user_data. Forcing IOSQE_NO_CQE to emit in case of an error is not really
better.

I know, it's hard to judge base on performance-testing-only patch, but the whole
idea is to greatly simplify userspace cqe handling, including errors. And I'd
like to find something better/faster and doing the same favor.

> 
> My box with the optane2 is out of commission, apparently, cannot get it
> going today. So I had to make do with my laptop, which does about ~600K
> random read IOPS. I don't see any difference there, using polled IO,
> using 4 link deep chains (so 1/4th the CQEs). Both run at around
> 611-613K IOPS.

-- 
Pavel Begunkov