Re: [PATCH 00/26] AIO performance improvements/cleanups, v2

Jens Axboe <jaxboe@xxxxxxxxxxxx> · Sat, 15 Dec 2012 10:46:32 +0100

On 2012-12-15 10:25, Kent Overstreet wrote:
> On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote:
>> On 2012-12-14 03:26, Jack Wang wrote:
>>> 2012/12/14 Jens Axboe <jaxboe@xxxxxxxxxxxx>:
>>>> On Mon, Dec 03 2012, Kent Overstreet wrote:
>>>>> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169
>>>>>
>>>>> Changes since the last posting should all be noted in the individual
>>>>> patch descriptions.
>>>>>
>>>>>  * Zach pointed out the aio_read_evt() patch was calling functions that
>>>>>    could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten.
>>>>>  * Ben pointed out some synchronize_rcu() usage was problematic,
>>>>>    converted it to call_rcu()
>>>>>  * The flush_dcache_page() patch is new
>>>>>  * Changed the "use cancellation list lazily" patch so as to remove
>>>>>    ki_flags from struct kiocb.
>>>>
>>>> Kent, I ran a few tests, and the below patches still don't seem as fast
>>>> as the approach I took. To keep it fair, I used your aio branch and
>>>> applied by dio speedups too. As a sanity check, I ran with your branch
>>>> alone as well. The quick results below - kaio is kent-aio, just your
>>>> branch. kaio-dio is with the direct IO speedups too. jaio is my branch,
>>>> which already has the dio changes too.
>>>>
>>>> Devices         Branch          IOPS
>>>> 1               kaio            ~915K
>>>> 1               kaio-dio        ~930K
>>>> 1               jaio           ~1220K
>>>> 6               kaio           ~3050K
>>>> 6               kaio-dio       ~3080K
>>>> 6               jaio            3500K
>>>>
>>>> The box runs out of CPU driving power, which is why it doesn't scale
>>>> linearly, otherwise I know that jaio at least does. It's basically
>>>> completion limited for the 6 device test at the moment.
>>>>
>>>> I'll run some profiling tomorrow morning and get you some better
>>>> results. Just thought I'd share these at least.
>>>>
>>>> --
>>>> Jens Axboe
>>>>
>>>
>>> A really good performance, woo.
>>>
>>> I think the device tested is really fast PCIe SSD builded by fusionio
>>> with fusionio in house block driver?
>>
>> It is pci-e flash storage, but it is not fusion-io.
>>
>>> any compare number with current mainline?
>>
>> Sure, I should have included that. Here's the table again, this time
>> with mainline as well.
>>
>> Devices         Branch          IOPS
>> 1               mainline        ~870K
>> 1               kaio            ~915K
>> 1               kaio-dio        ~930K
>> 1               jaio           ~1220K
>> 6               kaio           ~3050K
>> 6               kaio-dio       ~3080K
>> 6               jaio           ~3500K
>> 6               mainline       ~2850K
> 
> Cool, thanks for the numbers!
> 
> I suspect the difference is due to contention on the ringbuffer,
> completion side. You didn't enable my batched completion stuff, did you?

No, haven't tried the batching yet.

> I suspect the numbers would look quite a bit different with that,
> based on my own profiling. If the driver for the device you're testing
> on is open source, I'd be happy to do the conversion (it's a 5 minute
> job).

Knock yourself out - I already took a quick look at it, and conversion
should be pretty simple. It's the mtip32xx driver, it's in the kernel. I
would suggest getting rid of the ->async_callback() (since it's always
bio_endio()) since that'll make it cleaner.

> Also, I don't think our approaches really conflict - it's been awhile

Completely agree. I split my patches up a bit yesterday, and then I took
a look at your series. There's a bit of overlap between the two, but
really most of it would be useful together. You can see the (bit more)
split series here:

http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio

> since I looked at your patch but you're getting rid of the aio
> ringbuffer and using a linked list instead, right? My batched completion
> stuff should still benefit that case.

Yes, I make the ring interface optional. Basically you tell aio to use
the ring or not at io_queue_init() time. If you don't care about the
ring, we can use a lockless list for the completions.

You completely remove the cancel, I just make it optional for the gadget
case. I'm fine with either of them, though I did not look at your usb
change in detail. If it's clean, I suspect we should just kill cancel
completion as you did.

> Though - hrm, I'd have expected getting rid of the cancellation linked
> list to make a bigger difference and both our patchsets do that.

The machine in question runs out of oomph, which is hampering the
results. I should have it beefed up next week. It's running E5-2630
right now, will move to E5-2690. I think that should make the results
clearer.

> What device are you testing on, and what's your fio script? I may just
> have to buy some hardware so I can test this myself.

Pretty basic script, it's attached. Probably could eek more out of the
system, but it's been fine for just basic apples-to-apples comparison.
I'm using 6x p320h for this test case.

-- 
Jens Axboe

[global]
bs=4k
direct=1
ioengine=libaio
iodepth=42
numjobs=5
rwmixread=100
rw=randrw
iodepth_batch=8
iodepth_batch_submit=4
iodepth_batch_complete=4
random_generator=lfsr
group_reporting=1

[rssda]
cpus_allowed=0,2,4,6,8,10
filename=/dev/rssda

[rssdb]
cpus_allowed=0,2,4,6,8,10
filename=/dev/rssdb

[rssdc]
cpus_allowed=1,3,5,7,9,11
filename=/dev/rssdc

[rssdd]
cpus_allowed=1,3,5,7,9,11
filename=/dev/rssdd

[rssde]
cpus_allowed=1,3,5,7,9,11
filename=/dev/rssde

[rssdf]
cpus_allowed=1,3,5,7,9,11
filename=/dev/rssdf