Re: [RFC] IO scheduler based IO controller V7

Gui Jianfeng <guijianfeng@xxxxxxxxxxxxxx> · Mon, 03 Aug 2009 08:40:45 +0800

Vivek Goyal wrote:
> On Fri, Jul 31, 2009 at 01:21:51PM +0800, Gui Jianfeng wrote:
>> Hi Vivek,
>>
>> Here are some test results for normal reads and write for IO Controller V7 by fio.
>> Tested with "fairness == 0". It seems performance gets better comparing with V6.
>>
>> Mode         Normal read   |   Random read   |   Normal write   |   Random write  |  Direct read  |  Direct Write
>>
>> 2.6.31-rc1   71,613KiB/s       3,606KiB/s        66,250KiB/s        9,420KiB/s       51,535KiB/s     55,752KiB/s
>>
>> V7           70,540KiB/s       3,551KiB/s        64,548KiB/s        9,677KiB/s       53,530KiB/s     54,145KiB/s
>>
>> Performance  -1.5%             -1.5%             -2.6%              +2.7%            +3.9%           -2.9%
>>
> 
> Thanks Gui. Can you also try V7 with CONFIG_TRACK_ASYNC_CONTEXT=n. I tried
> that and I got better results for buffered writes.

  Yes, I'm also going to try it.

> 
> In my testing I still see some performance regression for buffered writes
> which goes away if I disable group io scheduling and just use flat mode.
> 
> I will spend more time to find out where it is coming from.
> 
> Thanks
> Vivek
> 
> 
>> Vivek Goyal wrote:
>>> Hi All,
>>>
>>> Here is the V7 of the IO controller patches generated on top of 2.6.31-rc4.
>>>
>>> For ease of patching, a consolidated patch is available here.
>>>
>>> http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v7.patch
>>>
>>> Previous versions of the patches was posted here.
>>>
>>> (V1) http://lkml.org/lkml/2009/3/11/486
>>> (V2) http://lkml.org/lkml/2009/5/5/275
>>> (V3) http://lkml.org/lkml/2009/5/26/472
>>> (V4) http://lkml.org/lkml/2009/6/8/580
>>> (V5) http://lkml.org/lkml/2009/6/19/279
>>> (V6) http://lkml.org/lkml/2009/7/2/369
>>>
>>> Changes from V6
>>> ===============
>>> - Introduced the notion of group_idling where we idle for next request to
>>>   come from the same group before we expire it. It is along the lines of
>>>   cfq's slice_idle thing to provide fairness. Switching to group idling
>>>   now helps in the sense that we don't have to rely whether queue idling
>>>   was turned on or not by CFQ. It becomes too much of debugging pain with
>>>   different work loads and different kind of storage media. Introduction
>>>   of group_idle should help.
>>>
>>> - Moved some of the code like dynamic queue idling update, arming queue
>>>   idling timer, keeping track of average think time etc back to CFQ. With
>>>   group idling we don't need it now. Reduce the amount of change.
>>>
>>> - Enabled cfq's close cooperator functionality in groups. So far this worked
>>>   only in root group. Now it should work in non-root groups also.
>>>
>>> - Got rid of the patch where we calculated disk time based on average disk
>>>   rate in some circumstances. It was giving bad numbers in early queue
>>>   deletion cases. Also did not think that it was helping a lot. Remvoed it
>>>   for the time being.
>>>  
>>> - Added an experimental patch to map sync requests using bio tracking info and
>>>   not task context. This is only for noop, deadline and AS.
>>>
>>> - Got rid of experimental patch of idling for async queues. Don't think it
>>>   was helping.
>>>
>>> - Got rid of wait_busy and wait_busy_done logic from queue. Instead
>>>   implemented it for groups.
>>>
>>> - Introduced oom_ioq to accomodate oom_cfqq change recently.
>>>
>>> - Broke-up elv_init_ioq() fn into smaller functions. It had 7 arguments and
>>>   looked complicated.
>>>
>>> - Fixed a bug in blk_queue_io_group_congested(). Thanks to Munehiro Ikeda.
>>>
>>> - Merged gui's patch to fix the cgroup file format issue.
>>>
>>> - Merged gui's patch to update per group congestion limit when
>>>   q->nr_group_requests is updated.
>>>
>>> - Fixed a bug where close cooperation will not work if we wait for all the
>>>   requests to finish from previous queue.
>>>
>>> - Fixed group deletion accouting where deletion from idle tree were also
>>>   appearing in the log.
>>>
>>> - Got rid of busy_rt_queues infrastructure.
>>>
>>> - Got rid of elv_ioq_request_dispatched(). An helper function just to
>>>   increment a variable.
>>>   
>>> Limitations
>>> ===========
>>>
>>> - This IO controller provides the bandwidth control at the IO scheduler
>>>   level (leaf node in stacked hiearchy of logical devices). So there can
>>>   be cases (depending on configuration) where application does not see
>>>   proportional BW division at higher logical level device.
>>>
>>>   LWN has written an article about the issue here.
>>>
>>> 	http://lwn.net/Articles/332839/
>>>
>>> How to solve the issue of fairness at higher level logical devices
>>> ==================================================================
>>> (Do we really need it? That's not where the contention for resources is.)
>>>
>>> Couple of suggestions have come forward.
>>>
>>> - Implement IO control at IO scheduler layer and then with the help of
>>>   some daemon, adjust the weight on underlying devices dynamiclly, depending
>>>   on what kind of BW gurantees are to be achieved at higher level logical
>>>   block devices.
>>>
>>> - Also implement a higher level IO controller along with IO scheduler
>>>   based controller and let user choose one depending on his needs.
>>>
>>>   A higher level controller does not know about the assumptions/policies
>>>   of unerldying IO scheduler, hence it has the potential to break down
>>>   the IO scheduler's policy with-in cgroup. A lower level controller
>>>   can work with IO scheduler much more closely and efficiently.
>>>  
>>> Other active IO controller developments
>>> =======================================
>>>
>>> IO throttling
>>> -------------
>>>
>>>   This is a max bandwidth controller and not the proportional one. Secondly
>>>   it is a second level controller which can break the IO scheduler's
>>>   policy/assumtions with-in cgroup. 
>>>
>>> dm-ioband
>>> ---------
>>>
>>>  This is a proportional bandwidth controller implemented as device mapper
>>>  driver. It is also a second level controller which can break the
>>>  IO scheduler's policy/assumptions with-in cgroup.
>>>
>>> TODO
>>> ====
>>> - code cleanups, testing, bug fixing, optimizations, benchmarking etc...
>>>
>>> Testing
>>> =======
>>>
>>> I have been able to do some testing as follows. All my testing is with ext3
>>> file system with a SATA drive which supports queue depth of 31.
>>>
>>> Test1 (Isolation between two KVM virtual machines)
>>> ==================================================
>>> Created two KVM virtual machines. Partitioned a disk on host in two partitions
>>> and gave one partition to each virtual machine. Put both the virtual machines
>>> in two different cgroup of weight 1000 and 500 each. Virtual machines created
>>> ext3 file system on the partitions exported from host and did buffered writes.
>>> Host seems writes as synchronous and virtual machine with higher weight gets
>>> double the disk time of virtual machine of lower weight. Used deadline
>>> scheduler in this test case.
>>>
>>> Some more details about configuration are in documentation patch.
>>>
>>> Test2 (Fairness for synchronous reads)
>>> ======================================
>>> - Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
>>>   cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)
>>>
>>>   Higher weight dd finishes first and at that point of time my script takes
>>>   care of reading cgroup files io.disk_time and io.disk_sectors for both the
>>>   groups and display the results.
>>>
>>>   dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
>>>   dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &
>>>
>>>   234179072 bytes (234 MB) copied, 3.9065 s, 59.9 MB/s
>>>   234179072 bytes (234 MB) copied, 5.19232 s, 45.1 MB/s
>>>
>>>   group1 time=8 16 2471 group1 sectors=8 16 457840
>>>   group2 time=8 16 1220 group2 sectors=8 16 225736
>>>
>>> First two fields in time and sectors statistics represent major and minor
>>> number of the device. Third field represents disk time in milliseconds and
>>> number of sectors transferred respectively.
>>>
>>> This patchset tries to provide fairness in terms of disk time received. group1
>>> got almost double of group2 disk time (At the time of first dd finish). These
>>> time and sectors statistics can be read using io.disk_time and io.disk_sector
>>> files in cgroup. More about it in documentation file.
>>>
>>> Test3 (Reader Vs Buffered Writes)
>>> ================================
>>> Buffered writes can be problematic and can overwhelm readers, especially with
>>> noop and deadline. IO controller can provide isolation between readers and
>>> buffered (async) writers.
>>>
>>> First I ran the test without io controller to see the severity of the issue.
>>> Ran a hostile writer and then after 10 seconds started a reader and then
>>> monitored the completion time of reader. Reader reads a 256 MB file. Tested
>>> this with noop scheduler.
>>>
>>> sample script
>>> ------------
>>> sync
>>> echo 3 > /proc/sys/vm/drop_caches
>>> time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152
>>> conv=fdatasync &
>>> sleep 10
>>> time dd if=/mnt/sdb/256M-file of=/dev/null &
>>>
>>> Results
>>> -------
>>> 8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer)
>>> 268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader)
>>>
>>> Now it was time to test io controller whether it can provide isolation between
>>> readers and writers with noop. I created two cgroups of weight 1000 each and
>>> put reader in group1 and writer in group 2 and ran the test again. Upon
>>> comletion of reader, my scripts read io.dis_time and io.disk_group cgroup
>>> files to get an estimate how much disk time each group got and how many
>>> sectors each group did IO for. 
>>>
>>> For more accurate accounting of disk time for buffered writes with queuing
>>> hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1".
>>>
>>> sample script
>>> -------------
>>> echo $$ > /cgroup/bfqio/test2/tasks
>>> dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 &
>>> sleep 10
>>> echo noop > /sys/block/$BLOCKDEV/queue/scheduler
>>> echo  1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
>>> echo $$ > /cgroup/bfqio/test1/tasks
>>> dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null &
>>> wait $!
>>> # Some code for reading cgroup files upon completion of reader.
>>> -------------------------
>>>
>>> Results
>>> =======
>>> 268435456 bytes (268 MB) copied, 6.65819 s, 40.3 MB/s (Reader) 
>>>
>>> group1 time=8 16 3063	group1 sectors=8 16 524808
>>> group2 time=8 16 3071	group2 sectors=8 16 441752
>>>
>>> Note, reader finishes now much lesser time and both group1 and group2
>>> got almost 3 seconds of disk time. Hence io-controller provides isolation
>>> from buffered writes.
>>>
>>> Test4 (AIO)
>>> ===========
>>>
>>> AIO reads
>>> -----------
>>> Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500
>>> respectively. I am using cfq scheduler. Following are some lines from my test
>>> script.
>>>
>>> ---------------------------------------------------------------
>>> echo 1000 > /cgroup/bfqio/test1/io.weight
>>> echo 500 > /cgroup/bfqio/test2/io.weight
>>>
>>> fio_args="--ioengine=libaio --rw=read --size=512M --direct=1"
>>> echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
>>>
>>> echo $$ > /cgroup/bfqio/test1/tasks
>>> fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
>>> --output=/mnt/$BLOCKDEV/fio1/test1.log
>>> --exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &
>>>
>>> echo $$ > /cgroup/bfqio/test2/tasks
>>> fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
>>> --output=/mnt/$BLOCKDEV/fio2/test2.log &
>>> ----------------------------------------------------------------
>>>
>>> test1 and test2 are two groups with weight 1000 and 500 respectively.
>>> "read-and-display-group-stats.sh" is one small script which reads the
>>> test1 and test2 cgroup files to determine how much disk time each group
>>> got till first fio job finished.
>>>
>>> Results
>>> ------
>>> test1 statistics: time=8 16 22403   sectors=8 16 1049640
>>> test2 statistics: time=8 16 11400   sectors=8 16 552864
>>>
>>> Above shows that by the time first fio (higher weight), finished, group
>>> test1 got 22403 ms of disk time and group test2 got 11400 ms of disk time.
>>> similarly the statistics for number of sectors transferred are also shown.
>>>
>>> Note that disk time given to group test1 is almost double of group2 disk
>>> time.
>>>
>>> AIO writes
>>> ----------
>>> Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500
>>> respectively. I am using cfq scheduler. Following are some lines from my test
>>> script.
>>>
>>> ------------------------------------------------
>>> echo 1000 > /cgroup/bfqio/test1/io.weight
>>> echo 500 > /cgroup/bfqio/test2/io.weight
>>> fio_args="--ioengine=libaio --rw=write --size=512M --direct=1"
>>>
>>> echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness
>>>
>>> echo $$ > /cgroup/bfqio/test1/tasks
>>> fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/
>>> --output=/mnt/$BLOCKDEV/fio1/test1.log
>>> --exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" &
>>>
>>> echo $$ > /cgroup/bfqio/test2/tasks
>>> fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/
>>> --output=/mnt/$BLOCKDEV/fio2/test2.log &
>>> -------------------------------------------------
>>>
>>> test1 and test2 are two groups with weight 1000 and 500 respectively.
>>> "read-and-display-group-stats.sh" is one small script which reads the
>>> test1 and test2 cgroup files to determine how much disk time each group
>>> got till first fio job finished.
>>>
>>> Following are the results.
>>>
>>> test1 statistics: time=8 16 29085   sectors=8 16 1049656
>>> test2 statistics: time=8 16 14652   sectors=8 16 516728
>>>
>>> Above shows that by the time first fio (higher weight), finished, group
>>> test1 got 28085 ms of disk time and group test2 got 14652 ms of disk time.
>>> similarly the statistics for number of sectors transferred are also shown.
>>>
>>> Note that disk time given to group test1 is almost double of group2 disk
>>> time.
>>>
>>> Test5 (Fairness for async writes, Buffered Write Vs Buffered Write)
>>> ===================================================================
>>> Fairness for async writes is tricky and biggest reason is that async writes
>>> are cached in higher layers (page cahe) as well as possibly in file system
>>> layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily
>>> in proportional manner.
>>>
>>> For example, consider two dd threads reading /dev/zero as input file and doing
>>> writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will
>>> be forced to write out some pages to disk before more pages can be dirtied. But
>>> not necessarily dirty pages of same thread are picked. It can very well pick
>>> the inode of lesser priority dd thread and do some writeout. So effectively
>>> higher weight dd is doing writeouts of lower weight dd pages and we don't see
>>> service differentation.
>>>
>>> IOW, the core problem with async write fairness is that higher weight thread
>>> does not throw enought IO traffic at IO controller to keep the queue
>>> continuously backlogged. In my testing, there are many .2 to .8 second
>>> intervals where higher weight queue is empty and in that duration lower weight
>>> queue get lots of job done giving the impression that there was no service
>>> differentiation.
>>>
>>> In summary, from IO controller point of view async writes support is there.
>>> Because page cache has not been designed in such a manner that higher 
>>> prio/weight writer can do more write out as compared to lower prio/weight
>>> writer, gettting service differentiation is hard and it is visible in some
>>> cases and not visible in some cases.
>>>
>>> Do we really care that much for fairness among two writer cgroups? One can
>>> choose to do direct writes or sync writes if fairness for writes really
>>> matters for him.
>>>
>>> Following is the only case where it is hard to ensure fairness between cgroups.
>>>
>>> - Buffered writes Vs Buffered Writes.
>>>
>>> So to test async writes I created two partitions on a disk and created ext3
>>> file systems on both the partitions.  Also created two cgroups and generated
>>> lots of write traffic in two cgroups (50 fio threads) and watched the disk
>>> time statistics in respective cgroups at the interval of 2 seconds. Thanks to
>>> ryo tsuruta for the test case.
>>>
>>> *****************************************************************
>>> sync
>>> echo 3 > /proc/sys/vm/drop_caches
>>>
>>> fio_args="--size=64m --rw=write --numjobs=50 --group_reporting"
>>>
>>> echo $$ > /cgroup/bfqio/test1/tasks
>>> fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log &
>>>
>>> echo $$ > /cgroup/bfqio/test2/tasks
>>> fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log &
>>> *********************************************************************** 
>>>
>>> And watched the disk time and sector statistics for the both the cgroups
>>> every 2 seconds using a script. How is snippet from output.
>>>
>>> test1 statistics: time=8 48 1315   sectors=8 48 55776 dq=8 48 1
>>> test2 statistics: time=8 48 633   sectors=8 48 14720 dq=8 48 2
>>>
>>> test1 statistics: time=8 48 5586   sectors=8 48 339064 dq=8 48 2
>>> test2 statistics: time=8 48 2985   sectors=8 48 146656 dq=8 48 3
>>>
>>> test1 statistics: time=8 48 9935   sectors=8 48 628728 dq=8 48 3
>>> test2 statistics: time=8 48 5265   sectors=8 48 278688 dq=8 48 4
>>>
>>> test1 statistics: time=8 48 14156   sectors=8 48 932488 dq=8 48 6
>>> test2 statistics: time=8 48 7646   sectors=8 48 412704 dq=8 48 7
>>>
>>> test1 statistics: time=8 48 18141   sectors=8 48 1231488 dq=8 48 10
>>> test2 statistics: time=8 48 9820   sectors=8 48 548400 dq=8 48 8
>>>
>>> test1 statistics: time=8 48 21953   sectors=8 48 1485632 dq=8 48 13
>>> test2 statistics: time=8 48 12394   sectors=8 48 698288 dq=8 48 10
>>>
>>> test1 statistics: time=8 48 25167   sectors=8 48 1705264 dq=8 48 13
>>> test2 statistics: time=8 48 14042   sectors=8 48 817808 dq=8 48 10
>>>
>>> First two fields in time and sectors statistics represent major and minor
>>> number of the device. Third field represents disk time in milliseconds and
>>> number of sectors transferred respectively.
>>>
>>> So disk time consumed by group1 is almost double of group2 in this case.
>>>
>>> Your feedback is welcome.
>>>
>>> Thanks
>>> Vivek
>>>
>>>
>>>
>> -- 
>> Regards
>> Gui Jianfeng
> 
> 
> 

-- 
Regards
Gui Jianfeng

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel