On 2012-12-15 10:25, Kent Overstreet wrote: > On Fri, Dec 14, 2012 at 08:35:53AM +0100, Jens Axboe wrote: >> On 2012-12-14 03:26, Jack Wang wrote: >>> 2012/12/14 Jens Axboe <jaxboe@xxxxxxxxxxxx>: >>>> On Mon, Dec 03 2012, Kent Overstreet wrote: >>>>> Last posting: http://thread.gmane.org/gmane.linux.kernel.aio.general/3169 >>>>> >>>>> Changes since the last posting should all be noted in the individual >>>>> patch descriptions. >>>>> >>>>> * Zach pointed out the aio_read_evt() patch was calling functions that >>>>> could sleep in TASK_INTERRUPTIBLE state, that patch is rewritten. >>>>> * Ben pointed out some synchronize_rcu() usage was problematic, >>>>> converted it to call_rcu() >>>>> * The flush_dcache_page() patch is new >>>>> * Changed the "use cancellation list lazily" patch so as to remove >>>>> ki_flags from struct kiocb. >>>> >>>> Kent, I ran a few tests, and the below patches still don't seem as fast >>>> as the approach I took. To keep it fair, I used your aio branch and >>>> applied by dio speedups too. As a sanity check, I ran with your branch >>>> alone as well. The quick results below - kaio is kent-aio, just your >>>> branch. kaio-dio is with the direct IO speedups too. jaio is my branch, >>>> which already has the dio changes too. >>>> >>>> Devices Branch IOPS >>>> 1 kaio ~915K >>>> 1 kaio-dio ~930K >>>> 1 jaio ~1220K >>>> 6 kaio ~3050K >>>> 6 kaio-dio ~3080K >>>> 6 jaio 3500K >>>> >>>> The box runs out of CPU driving power, which is why it doesn't scale >>>> linearly, otherwise I know that jaio at least does. It's basically >>>> completion limited for the 6 device test at the moment. >>>> >>>> I'll run some profiling tomorrow morning and get you some better >>>> results. Just thought I'd share these at least. >>>> >>>> -- >>>> Jens Axboe >>>> >>> >>> A really good performance, woo. >>> >>> I think the device tested is really fast PCIe SSD builded by fusionio >>> with fusionio in house block driver? >> >> It is pci-e flash storage, but it is not fusion-io. >> >>> any compare number with current mainline? >> >> Sure, I should have included that. Here's the table again, this time >> with mainline as well. >> >> Devices Branch IOPS >> 1 mainline ~870K >> 1 kaio ~915K >> 1 kaio-dio ~930K >> 1 jaio ~1220K >> 6 kaio ~3050K >> 6 kaio-dio ~3080K >> 6 jaio ~3500K >> 6 mainline ~2850K > > Cool, thanks for the numbers! > > I suspect the difference is due to contention on the ringbuffer, > completion side. You didn't enable my batched completion stuff, did you? No, haven't tried the batching yet. > I suspect the numbers would look quite a bit different with that, > based on my own profiling. If the driver for the device you're testing > on is open source, I'd be happy to do the conversion (it's a 5 minute > job). Knock yourself out - I already took a quick look at it, and conversion should be pretty simple. It's the mtip32xx driver, it's in the kernel. I would suggest getting rid of the ->async_callback() (since it's always bio_endio()) since that'll make it cleaner. > Also, I don't think our approaches really conflict - it's been awhile Completely agree. I split my patches up a bit yesterday, and then I took a look at your series. There's a bit of overlap between the two, but really most of it would be useful together. You can see the (bit more) split series here: http://git.kernel.dk/?p=linux-block.git;a=shortlog;h=refs/heads/aio-dio > since I looked at your patch but you're getting rid of the aio > ringbuffer and using a linked list instead, right? My batched completion > stuff should still benefit that case. Yes, I make the ring interface optional. Basically you tell aio to use the ring or not at io_queue_init() time. If you don't care about the ring, we can use a lockless list for the completions. You completely remove the cancel, I just make it optional for the gadget case. I'm fine with either of them, though I did not look at your usb change in detail. If it's clean, I suspect we should just kill cancel completion as you did. > Though - hrm, I'd have expected getting rid of the cancellation linked > list to make a bigger difference and both our patchsets do that. The machine in question runs out of oomph, which is hampering the results. I should have it beefed up next week. It's running E5-2630 right now, will move to E5-2690. I think that should make the results clearer. > What device are you testing on, and what's your fio script? I may just > have to buy some hardware so I can test this myself. Pretty basic script, it's attached. Probably could eek more out of the system, but it's been fine for just basic apples-to-apples comparison. I'm using 6x p320h for this test case. -- Jens Axboe
[global] bs=4k direct=1 ioengine=libaio iodepth=42 numjobs=5 rwmixread=100 rw=randrw iodepth_batch=8 iodepth_batch_submit=4 iodepth_batch_complete=4 random_generator=lfsr group_reporting=1 [rssda] cpus_allowed=0,2,4,6,8,10 filename=/dev/rssda [rssdb] cpus_allowed=0,2,4,6,8,10 filename=/dev/rssdb [rssdc] cpus_allowed=1,3,5,7,9,11 filename=/dev/rssdc [rssdd] cpus_allowed=1,3,5,7,9,11 filename=/dev/rssdd [rssde] cpus_allowed=1,3,5,7,9,11 filename=/dev/rssde [rssdf] cpus_allowed=1,3,5,7,9,11 filename=/dev/rssdf