I got some clue on what was going on while doing 4K sequential read using fio. If I use ioengine in fio script as "libaio", I see " do_io_submit" call plug/unplug the queue before submitting the IO. It means after every IO, we expect to send IO immediately to the next layer. If at all there are any pending IO do merge.. but not due to plugging. This is what happens on my test. Every time IO comes from application with libaio engine, it send down to the elevator/io-scheduler because queue was unplugged in do_io_submit(). Moment I reduce the queue depth of the block device, merge start because of congestion at scsi mid layer. If I use, mmap engine, I see merged IO coming to the device driver because of plugging. I really don't know how it works, but gave a try and found merge happen because of plugging. ( I confirm using blktrace) Is there any ioengine in <fio> (or any other parameter setting), which can use plugging mechanism of block layer to merge more IO other than mmap ? ~ Kashyap > -----Original Message----- > From: Desai, Kashyap > Sent: Wednesday, April 30, 2014 12:33 PM > To: 'axboe@xxxxxxxxx' > Cc: linux-scsi@xxxxxxxxxxxxxxx > Subject: How to get more sequential IO merged at elevator > > Jens, > > While working on one issue of less IOPs for sequential READ/WRITE, I found > interesting stuffs which was causing performance drop for sequential IO. I > did some reverse engineering on Block layer code to understand how to get > benefit from any sysfs parameters settings, but could not find anything > useful to solve this issue. > > I have described problem statement and root cause of this issue in this mail > thread. > > Problem statement - "Cannot achieve Sequential read/write performance, > because of back merge is not happening frequently" > > Here is my understanding of back merge done in elevator. > > Linux block layer is responsible for merging/sorting of the IO with the help of > Elevator hook in OS + IO scheduler. > IO scheduler does not have any role in merging sequential IO. It is done in > Elevator hook, so choosing any IO scheduler in Linux will not help (OR you can > consider that behavior will be unchanged irrespective of IO scheduler). Any > sequential IO will be merge at Elevator code path. > > 1. When IO comes from upper layer, it will be queued at Elevator/IO > scheduler level. It will also add IO into hash look up which will be used for > merge and other purpose. > 2. Elevator code will search any outstanding IO (in the queue of the same > layer). If there is any chance to merge it, it will perform (BACK MERGE) > merge 3. If there is no merge possible, IO will be queued to the next level > (which is IO scheduler). > 4. In IO completion Path, IO scheduler will post IO to the Driver queue, if at all > there is any outstanding IO. (There are many other condition, but this is very > common code path) > > To merge more command, #2 should find more outstanding in hash table > look up. This is possible if flow control start either at block layer/Driver level. > It means, driver/block layer forcefully delay IO submission to next level and > give more chance at elevator code to merge more IO via accumulating more > IO from user space. > > If I manually change queue depth of the device to lower value (between 1- > 8), which is doing only Sequential IO.. I am able to see maximum IO coming to > the driver after merge and it eventually increase the IOPs. > > Is there any way to increase possibility of merged IO coming from block layer > to the Low level driver ? > > Thanks, Kashyap -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html