Re: [RFC 0/2] 3 cacheline io_kiocb

Jens Axboe <axboe@xxxxxxxxx> · Sat, 25 Jul 2020 14:25:49 -0600

On 7/25/20 2:14 PM, Pavel Begunkov wrote:
>  On 25/07/2020 22:40, Jens Axboe wrote:
>> On 7/25/20 12:24 PM, Pavel Begunkov wrote:
>>> On 25/07/2020 18:45, Jens Axboe wrote:
>>>> On 7/25/20 2:31 AM, Pavel Begunkov wrote:
>>>>> That's not final for a several reasons, but good enough for discussion.
>>>>> That brings io_kiocb down to 192B. I didn't try to benchmark it
>>>>> properly, but quick nop test gave +5% throughput increase.
>>>>> 7531 vs 7910 KIOPS with fio/t/io_uring
>>>>>
>>>>> The whole situation is obviously a bunch of tradeoffs. For instance,
>>>>> instead of shrinking it, we can inline apoll to speed apoll path.
>>>>>
>>>>> [2/2] just for a reference, I'm thinking about other ways to shrink it.
>>>>> e.g. ->link_list can be a single-linked list with linked tiemouts
>>>>> storing a back-reference. This can turn out to be better, because
>>>>> that would move ->fixed_file_refs to the 2nd cacheline, so we won't
>>>>> ever touch 3rd cacheline in the submission path.
>>>>> Any other ideas?
>>>>
>>>> Nothing noticeable for me, still about the same performance. But
>>>> generally speaking, I don't necessarily think we need to go all in on
>>>> making this as tiny as possible. It's much more important to chase the
>>>> items where we only use 2 cachelines for the hot path, and then we have
>>>> the extra space in there already for the semi hot paths like poll driven
>>>> retry. Yes, we're still allocating from a pool that has slightly larger
>>>> objects, but that doesn't really matter _that_ much. Avoiding an extra
>>>> kmalloc+kfree for the semi hot paths are a bigger deal than making
>>>> io_kiocb smaller and smaller.
>>>>
>>>> That said, for no-brainer changes, we absolutely should make it smaller.
>>>> I just don't want to jump through convoluted hoops to get there.
>>>
>>> Agree, but that's not the end goal. The first point is to kill the union,
>>> but it already has enough space for that.
>>
>> Right
>>
>>> The second is to see, whether we can use the space in a better way. From
>>> the high level perspective ->apoll and ->work are alike and both serve to
>>> provide asynchronous paths, hence the idea to swap them naturally comes to
>>> mind.
>>
>> Totally agree, which is why the union of those kind of makes sense.
>> We're definitely NOT using them at the same time, but the fact that we
>> had various mm/creds/whatnot in the work_struct made that a bit iffy.
> 
> Thinking of it, if combined with work de-init as you proposed before, it's
> probably possible to make a layout similar to the one below
> 
> struct io_kiocb {
> 	...
> 	struct hlist_node	hash_node;
> 	struct callback_head	task_work;	
> 	union {
> 		struct io_wq_work	work;
> 		struct async_poll	apoll;
> 	};
> };
> 
> Saves ->apoll kmalloc(), and the actual work de-init would be negligibly
> rare. Worth to try

And hopefully get rid of the stupid apoll->work and the copy back and
forth... But yes, this would be most excellent, and an ideal layout.

>>> TBH, I don't think it'd do much, because init of ->io would probably
>>> hide any benefit.
>>
>> There should be no ->io init/alloc for this test case.
> 
> I mean, before getting into io_arm_poll_handler(), it should get -EAGAIN
> in io_{read,write}() and initialise ->io in io_setup_async_rw(), at least
> for READV, WRITEV.

Sure, but for my testing, there's never an EAGAIN, so I won't be
exercising that path for the peak testing.

-- 
Jens Axboe