On Sun, Aug 16, 2009 at 03:30:22PM -0400, Vivek Goyal wrote: > > Hi All, > > Here is the V8 of the IO controller patches generated on top of 2.6.31-rc6. > Forgot to mention that for ease of patching a consolidated patch is here. http://people.redhat.com/~vgoyal/io-controller/io-scheduler-based-io-controller-v8.patch Thanks Vivek > Previous versions of the patches was posted here. > > (V1) http://lkml.org/lkml/2009/3/11/486 > (V2) http://lkml.org/lkml/2009/5/5/275 > (V3) http://lkml.org/lkml/2009/5/26/472 > (V4) http://lkml.org/lkml/2009/6/8/580 > (V5) http://lkml.org/lkml/2009/6/19/279 > (V6) http://lkml.org/lkml/2009/7/2/369 > (V7) http://lkml.org/lkml/2009/7/24/253 > > Changes from V7 > =============== > - Replaced BFQ with CFS+CFQ like hierarchical scheduler. > > Moving to time domain as service parameter had broken BFQ's assumptions > about how long a queue runs (queue can run more than budget) and that in > turn has potential to break the O(1) gurantees of BFQ. > > In addition, BFQ was relatively complex and not sure if benefits were > proportionate in time domain setup. Hence for the time being trying to > replace BFQ with a simpler scheduler and see how well does it perform. > > This scheduler borrows the ideas from CFS and CFQ. Time slices to queues are > allocated based on their priority (like CFQ). These disk times are converted > to virtual disk time and we keep track of each queue's vdisktime and each > service tree's min_vdisktime to determine who has consumed how much disk > time and who should run next (like CFS). > > - Fixed few issues reported by Jerome Marchand. > > Apart from this there are miscellaneous cleaups like getting rid of not so > necessary comments, function renames, debug code re-organization etc. > > Limitations > =========== > > - This IO controller provides the bandwidth control at the IO scheduler > level (leaf node in stacked hiearchy of logical devices). So there can > be cases (depending on configuration) where application does not see > proportional BW division at higher logical level device. > > LWN has written an article about the issue here. > > http://lwn.net/Articles/332839/ > > How to solve the issue of fairness at higher level logical devices > ================================================================== > (Do we really need it? That's not where the contention for resources is.) > > Couple of suggestions have come forward. > > - Implement IO control at IO scheduler layer and then with the help of > some daemon, adjust the weight on underlying devices dynamiclly, depending > on what kind of BW gurantees are to be achieved at higher level logical > block devices. > > - Also implement a higher level IO controller along with IO scheduler > based controller and let user choose one depending on his needs. > > A higher level controller does not know about the assumptions/policies > of unerldying IO scheduler, hence it has the potential to break down > the IO scheduler's policy with-in cgroup. A lower level controller > can work with IO scheduler much more closely and efficiently. > > Other active IO controller developments > ======================================= > > IO throttling > ------------- > > This is a max bandwidth controller and not the proportional one. Secondly > it is a second level controller which can break the IO scheduler's > policy/assumtions with-in cgroup. > > dm-ioband > --------- > > This is a proportional bandwidth controller implemented as device mapper > driver. It is also a second level controller which can break the > IO scheduler's policy/assumptions with-in cgroup. > > TODO > ==== > - code cleanups, testing, bug fixing, optimizations, benchmarking etc... > > Testing > ======= > > I have been able to do some testing as follows. All my testing is with ext3 > file system with a SATA drive which supports queue depth of 31. > > Test1 (Isolation between two KVM virtual machines) > ================================================== > Created two KVM virtual machines. Partitioned a disk on host in two partitions > and gave one partition to each virtual machine. Put both the virtual machines > in two different cgroup of weight 1000 and 500 each. Virtual machines created > ext3 file system on the partitions exported from host and did buffered writes. > Host seems writes as synchronous and virtual machine with higher weight gets > double the disk time of virtual machine of lower weight. Used deadline > scheduler in this test case. > > Some more details about configuration are in documentation patch. > > Test2 (Fairness for synchronous reads) > ====================================== > - Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those > cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1) > > Higher weight dd finishes first and at that point of time my script takes > care of reading cgroup files io.disk_time and io.disk_sectors for both the > groups and display the results. > > dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null & > dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null & > > group1 time=8:16 2452 group1 sectors=8:16 457856 > group2 time=8:16 1317 group2 sectors=8:16 247008 > > 234179072 bytes (234 MB) copied, 3.90912 s, 59.9 MB/s > 234179072 bytes (234 MB) copied, 5.15548 s, 45.4 MB/s > > First two fields in time and sectors statistics represent major and minor > number of the device. Third field represents disk time in milliseconds and > number of sectors transferred respectively. > > This patchset tries to provide fairness in terms of disk time received. group1 > got almost double of group2 disk time (At the time of first dd finish). These > time and sectors statistics can be read using io.disk_time and io.disk_sector > files in cgroup. More about it in documentation file. > > Test3 (Reader Vs Buffered Writes) > ================================ > Buffered writes can be problematic and can overwhelm readers, especially with > noop and deadline. IO controller can provide isolation between readers and > buffered (async) writers. > > First I ran the test without io controller to see the severity of the issue. > Ran a hostile writer and then after 10 seconds started a reader and then > monitored the completion time of reader. Reader reads a 256 MB file. Tested > this with noop scheduler. > > sample script > ------------ > sync > echo 3 > /proc/sys/vm/drop_caches > time dd if=/dev/zero of=/mnt/sdb/reader-writer-zerofile bs=4K count=2097152 > conv=fdatasync & > sleep 10 > time dd if=/mnt/sdb/256M-file of=/dev/null & > > Results > ------- > 8589934592 bytes (8.6 GB) copied, 106.045 s, 81.0 MB/s (Writer) > 268435456 bytes (268 MB) copied, 96.5237 s, 2.8 MB/s (Reader) > > Now it was time to test io controller whether it can provide isolation between > readers and writers with noop. I created two cgroups of weight 1000 each and > put reader in group1 and writer in group 2 and ran the test again. Upon > comletion of reader, my scripts read io.disk_time and io.disk_sectors cgroup > files to get an estimate how much disk time each group got and how many > sectors each group did IO for. > > For more accurate accounting of disk time for buffered writes with queuing > hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "1". > > sample script > ------------- > echo $$ > /cgroup/bfqio/test2/tasks > dd if=/dev/zero of=/mnt/$BLOCKDEV/testzerofile bs=4K count=2097152 & > sleep 10 > echo noop > /sys/block/$BLOCKDEV/queue/scheduler > echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness > echo $$ > /cgroup/bfqio/test1/tasks > dd if=/mnt/$BLOCKDEV/256M-file of=/dev/null & > wait $! > # Some code for reading cgroup files upon completion of reader. > ------------------------- > > Results > ======= > 68435456 bytes (268 MB) copied, 6.87668 s, 39.0 MB/s > > group1 time=8:16 3719 group1 sectors=8:16 524816 > group2 time=8:16 3659 group2 sectors=8:16 638712 > > Note, reader finishes now much lesser time and both group1 and group2 > got almost 3 seconds of disk time. Hence io-controller provides isolation > from buffered writes. > > Test4 (AIO) > =========== > > AIO reads > ----------- > Set up two fio, AIO read jobs in two cgroup with weight 1000 and 500 > respectively. I am using cfq scheduler. Following are some lines from my test > script. > > --------------------------------------------------------------- > echo 1000 > /cgroup/bfqio/test1/io.weight > echo 500 > /cgroup/bfqio/test2/io.weight > > fio_args="--ioengine=libaio --rw=read --size=512M --direct=1" > echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness > > echo $$ > /cgroup/bfqio/test1/tasks > fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/ > --output=/mnt/$BLOCKDEV/fio1/test1.log > --exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" & > > echo $$ > /cgroup/bfqio/test2/tasks > fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/ > --output=/mnt/$BLOCKDEV/fio2/test2.log & > ---------------------------------------------------------------- > > test1 and test2 are two groups with weight 1000 and 500 respectively. > "read-and-display-group-stats.sh" is one small script which reads the > test1 and test2 cgroup files to determine how much disk time each group > got till first fio job finished. > > Results > ------ > test1 statistics: time=8:16 17686 sectors=8:16 1049664 > test2 statistics: time=8:16 9036 sectors=8:16 585152 > > Above shows that by the time first fio (higher weight), finished, group > test1 got 17686 ms of disk time and group test2 got 9036 ms of disk time. > similarly the statistics for number of sectors transferred are also shown. > > Note that disk time given to group test1 is almost double of group2 disk > time. > > AIO writes > ---------- > Set up two fio, AIO direct write jobs in two cgroup with weight 1000 and 500 > respectively. I am using cfq scheduler. Following are some lines from my test > script. > > ------------------------------------------------ > echo 1000 > /cgroup/bfqio/test1/io.weight > echo 500 > /cgroup/bfqio/test2/io.weight > fio_args="--ioengine=libaio --rw=write --size=512M --direct=1" > > echo 1 > /sys/block/$BLOCKDEV/queue/iosched/fairness > > echo $$ > /cgroup/bfqio/test1/tasks > fio $fio_args --name=test1 --directory=/mnt/$BLOCKDEV/fio1/ > --output=/mnt/$BLOCKDEV/fio1/test1.log > --exec_postrun="../read-and-display-group-stats.sh $maj_dev $minor_dev" & > > echo $$ > /cgroup/bfqio/test2/tasks > fio $fio_args --name=test2 --directory=/mnt/$BLOCKDEV/fio2/ > --output=/mnt/$BLOCKDEV/fio2/test2.log & > ------------------------------------------------- > > test1 and test2 are two groups with weight 1000 and 500 respectively. > "read-and-display-group-stats.sh" is one small script which reads the > test1 and test2 cgroup files to determine how much disk time each group > got till first fio job finished. > > Following are the results. > > test1 statistics: time=8:16 25509 sectors=8:16 1049688 > test2 statistics: time=8:16 12863 sectors=8:16 527104 > > Above shows that by the time first fio (higher weight), finished, group > test1 got almost double the disk time of group test2. > > Test5 (Fairness for async writes, Buffered Write Vs Buffered Write) > =================================================================== > Fairness for async writes is tricky and biggest reason is that async writes > are cached in higher layers (page cahe) as well as possibly in file system > layer also (btrfs, xfs etc), and are dispatched to lower layers not necessarily > in proportional manner. > > For example, consider two dd threads reading /dev/zero as input file and doing > writes of huge files. Very soon we will cross vm_dirty_ratio and dd thread will > be forced to write out some pages to disk before more pages can be dirtied. But > not necessarily dirty pages of same thread are picked. It can very well pick > the inode of lesser priority dd thread and do some writeout. So effectively > higher weight dd is doing writeouts of lower weight dd pages and we don't see > service differentation. > > IOW, the core problem with async write fairness is that higher weight thread > does not throw enought IO traffic at IO controller to keep the queue > continuously backlogged. In my testing, there are many .2 to .8 second > intervals where higher weight queue is empty and in that duration lower weight > queue get lots of job done giving the impression that there was no service > differentiation. > > In summary, from IO controller point of view async writes support is there. > Because page cache has not been designed in such a manner that higher > prio/weight writer can do more write out as compared to lower prio/weight > writer, gettting service differentiation is hard and it is visible in some > cases and not visible in some cases. > > Do we really care that much for fairness among two writer cgroups? One can > choose to do direct writes or sync writes if fairness for writes really > matters for him. > > Following is the only case where it is hard to ensure fairness between cgroups. > > - Buffered writes Vs Buffered Writes. > > So to test async writes I created two partitions on a disk and created ext3 > file systems on both the partitions. Also created two cgroups and generated > lots of write traffic in two cgroups (50 fio threads) and watched the disk > time statistics in respective cgroups at the interval of 2 seconds. Thanks to > ryo tsuruta for the test case. > > ***************************************************************** > sync > echo 3 > /proc/sys/vm/drop_caches > > fio_args="--size=64m --rw=write --numjobs=50 --group_reporting" > > echo $$ > /cgroup/bfqio/test1/tasks > fio $fio_args --name=test1 --directory=/mnt/sdd1/fio/ --output=/mnt/sdd1/fio/test1.log & > > echo $$ > /cgroup/bfqio/test2/tasks > fio $fio_args --name=test2 --directory=/mnt/sdd2/fio/ --output=/mnt/sdd2/fio/test2.log & > *********************************************************************** > > And watched the disk time and sector statistics for the both the cgroups > every 2 seconds using a script. How is snippet from output. > > test1 statistics: time=8:16 1631 sectors=8:16 1680 dq=8:16 2 > test2 statistics: time=8:16 896 sectors=8:16 976 dq=8:16 1 > > test1 statistics: time=8:16 6031 sectors=8:16 88536 dq=8:16 5 > test2 statistics: time=8:16 3192 sectors=8:16 4080 dq=8:16 1 > > test1 statistics: time=8:16 10425 sectors=8:16 390496 dq=8:16 5 > test2 statistics: time=8:16 5272 sectors=8:16 77896 dq=8:16 4 > > test1 statistics: time=8:16 15396 sectors=8:16 747256 dq=8:16 5 > test2 statistics: time=8:16 7852 sectors=8:16 235648 dq=8:16 4 > > test1 statistics: time=8:16 20302 sectors=8:16 1180168 dq=8:16 5 > test2 statistics: time=8:16 10297 sectors=8:16 391208 dq=8:16 4 > > test1 statistics: time=8:16 25244 sectors=8:16 1579928 dq=8:16 6 > test2 statistics: time=8:16 12748 sectors=8:16 613096 dq=8:16 4 > > test1 statistics: time=8:16 30095 sectors=8:16 1927848 dq=8:16 6 > test2 statistics: time=8:16 15135 sectors=8:16 806112 dq=8:16 4 > > First two fields in time and sectors statistics represent major and minor > number of the device. Third field represents disk time in milliseconds and > number of sectors transferred respectively. > > So disk time consumed by group1 is almost double of group2 in this case. > > Thanks > Vivek -- dm-devel mailing list dm-devel@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/dm-devel