Re: [patch,rfc] aio: allocate kiocbs in batches

Jeff Moyer <jmoyer@xxxxxxxxxx> · Thu, 01 Sep 2011 09:56:38 -0400

Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

> On Tue, 05 Jul 2011 16:35:21 -0400
> Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
>
>> Hi,
>> 
>> When looking at perf data for my aio-stress test runs against some
>> relatively fast storage, I noticed that __aio_get_req was taking up more
>> CPU than I expected.  The command line for aio-stress was:
>>   aio-stress -O -r 8 -d 256 -b 32 -l -s 16384 /dev/sdk
>> which means we're sending 32 iocbs to io_submit at a time.  Here's the
>> perf report -g output:
>> 
>> Events: 41K cycles
>> -      7.33%  aio-stress  [kernel.kallsyms]             [k] _raw_spin_lock_irq
>>    - _raw_spin_lock_irq
>>       + 75.38% scsi_request_fn
>>       - 10.67% __aio_get_req
>>            io_submit_one
>> 
>> The spinlock is the ctx_lock, taken when allocating the request.  This
>> seemed like pretty low-hanging fruit to me, so I cooked up a patch to
>> allocate 32 kiocbs at a time.  This is the result:
>> 
>> Events: 50K cycles
>> -      5.54%  aio-stress  [kernel.kallsyms]             [k] _raw_spin_lock_irq
>>    - _raw_spin_lock_irq
>>       + 83.23% scsi_request_fn
>>       - 5.14% io_submit_one
>> 
>> So, aio-stress now takes up ~2% less CPU.  This seems like a worth-while
>> thing to do, but I don't have a system handy that would show any real
>> performance gains from it.  The higher the number of iocbs passed to
>> io_submit, the more of a win this will be.
>> 
>> FWIW, I tested this with multiple different batch sizes (1, 32, 33, 64,
>> 256) and it didn't blow up.
>> 
>
> So the benefit comes from reducing the frequency with which ctx_lock is
> taken, rather than from reducing its per-acquisition hold times.

That's right.  The lock is taken in the submission and completion paths.

>> ...
>>
>> +#define KIOCB_STASH_SIZE	32
>> +struct kiocb_stash {
>> +	unsigned head;
>> +	unsigned tail;
>> +	unsigned total;
>> +	unsigned used;
>> +	struct kiocb *kiocbs[KIOCB_STASH_SIZE];
>> +};
>
> Laziness begets laziness: you didn't bother documenting how this ring
> thing works and why it exists, so I didn't bother reverse-engineering
> that infomation.

Sorry.  I used a ring as I didn't want to introduce more allocations in
the submission path (though I have no real data on which to base this
premature optimization).  As for documenting how it works, I thought it
was pretty straight-forward.

> Once all this has been fixed, I might be wondering why it was done with
> a yukky fixed-size array rather than with a list.

I could change it to a list, if that's preferred.  Remember, this was an
RFC.

> A pox on you for telling us there's a "potential starvation case" but
> not telling us what it is!  It's a kmalloc failure - why is this
> described as "starvation"?

Perhaps you missed that I simply moved that comment in the code;  I
didn't add it.  When allocating an I/O context, the user supplies a
maximum number of events.  That number is used to size the ring buffer
that holds completion data.  It is possible that there is a completed
request waiting for a final fput that is keeping a new allocation from
succeeding.  Calling the aio_fput_routine can free that up, allowing the
allocation to succeed.

> Once all this has been fixed, I might be wondering how we test the
> "potential starvation case".

Well, probably a close on the fd prior to event completion, coupled with
submitting new I/O (on a different fd) when the event completion is
delivered to userspace but before the fput on the fd.

> Various trivial fixies:
>
>
> From: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>
> - coding-style fixes
>
> - improve variable name(s)
>
> - fix typo(s)
>
> - use new kmap_atomic() interface
>
> - change comment to use the number of columns God gave us
>
> - use "bool" for a boolean
>
> Cc: Jeff Moyer <jmoyer@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>

This was an RFC.  I'm happy to have it merged, but I would have been
equally happy to have addressed your review comments first.  In the end,
whatever works best for you is fine with me.

Thanks for taking time to look at it!

Cheers,
Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html