Re: blkio cgroup

Vivek Goyal <vgoyal@xxxxxxxxxx> · Mon, 21 Feb 2011 10:44:09 -0500

On Mon, Feb 21, 2011 at 03:36:14PM +0800, Gui Jianfeng wrote:
> Dominik,
> 
> Would you try "oflag=direct" when you do tests in Guests. And make sure
> /sys/block/xxx/queue/iosched/group_isolation is set to 1.

oflag=direct in guest might be good for testing and understanding the
problem, but in practice we will not have a control over what a user is
running inside guest. The only control we will have is to use cache=none for
guest and then control any traffic coming out of guest.

Thanks
Vivek

> 
> I guess with such setting, your tests should goes well.
> 
> Thanks,
> Gui
> 
> Vivek Goyal wrote:
> > On Fri, Feb 18, 2011 at 03:42:45PM +0100, Dominik Klein wrote:
> >> Hi Vivek
> >>
> >> I don't know whether you follow the libvirt list, I assume you don't. So
> >> I thought I'd forward you an E-Mail involving the blkio controller and a
> >> terrible situation arising from using it (maybe in a wrong way).
> >>
> >> I'd truely appreciate it if you read it and commented on it. Maybe I did
> >> something wrong, but maybe also I found a bug in some way.
> > 
> > Hi Dominik, 
> > 
> > Thanks for forwarding me this mail. Yes, I am not on libvir-list. I have
> > just now subscribed.
> > 
> > Few questions inline.
> > 
> >> -------- Original Message --------
> >> Subject: Re:  [PATCH 0/6 v3] Add blkio cgroup support
> >> Date: Fri, 18 Feb 2011 14:42:51 +0100
> >> From: Dominik Klein <dk@xxxxxxxxxxxxxxxx>
> >> To: libvir-list@xxxxxxxxxx
> >>
> >> Hi
> >>
> >> back with some testing results.
> >>
> >>>> how about the start Guest with option "cache=none" to bypass pagecache?
> >>>> This should help i think.
> >>> I will read up on where to set that and give it a try. Thanks for the hint.
> >> So here's what I did and found out:
> >>
> >> The host system has 2 12 core CPUs and 128 GB of Ram.
> >>
> >> I have 8 test VMs named kernel1 to kernel8. Each VM has 4 VCPUs, 2 GB of
> >> RAm and one disk, which is an lv on the host. Cache mode is "none":
> > 
> > So you have only one root SATA disk and setup a linear logical volume on
> > that? I not, can you give more info about the storage configuration?
> > 
> > - I am assuming you are using CFQ on your underlying physical disk.
> > 
> > - What kernel version are you testing with.
> > 
> > - Cache=none mode is good which should make all the IO O_DIRECT on host
> >   and should show up as SYNC IO on CFQ without losing io context info.
> >   The onlly probelm is intermediate dm layer and if it is changing the
> >   io context somehow. I am not sure at this point of time.
> > 
> > - Is it possible to capture 10-15 second blktrace on your underlying
> >   physical device. That should give me some idea what's happening.
> > 
> > - Can you also try setting /sys/block/<disk>/queue/iosched/group_isolation=1
> >   on your underlying physical device where CFQ is running and see if it makes
> >   any difference.
> > 
> >> for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7
> >> kernel8; do virsh dumpxml $vm|grep cache; done
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>       <driver name='qemu' type='raw' cache='none'/>
> >>
> >> My goal is to give more I/O time to kernel1 and kernel2 than to the rest
> >> of the VMs.
> >>
> >> mount -t cgroup -o blkio none /mnt
> >> cd /mnt
> >> mkdir important
> >> mkdir notimportant
> >>
> >> echo 1000 > important/blkio.weight
> >> echo 100 > notimportant/blkio.weight
> >> for vm in kernel3 kernel4 kernel5 kernel6 kernel7 kernel8; do
> >> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> >> for task in *; do
> >> /bin/echo $task > /mnt/notimportant/tasks
> >> done
> >> done
> >>
> >> for vm in kernel1 kernel2; do
> >> cd /proc/$(pgrep -f "qemu-kvm.*$vm")/task
> >> for task in *; do
> >> /bin/echo $task > /mnt/important/tasks
> >> done
> >> done
> >>
> >> Then I used cssh to connect to all 8 VMs and execute
> >> dd if=/dev/zero of=testfile bs=1M count=1500
> >> in all VMs simultaneously.
> >>
> >> Results are:
> >> kernel1: 47.5593 s, 33.1 MB/s
> >> kernel2: 60.1464 s, 26.2 MB/s
> >> kernel3: 74.204 s, 21.2 MB/s
> >> kernel4: 77.0759 s, 20.4 MB/s
> >> kernel5: 65.6309 s, 24.0 MB/s
> >> kernel6: 81.1402 s, 19.4 MB/s
> >> kernel7: 70.3881 s, 22.3 MB/s
> >> kernel8: 77.4475 s, 20.3 MB/s
> >>
> >> Results vary a little bit from run to run, but it is nothing
> >> spectacular, as weights of 1000 vs. 100 would suggest.
> >>
> >> So I went and tried to throttle I/O of kernel3-8 to 10MB/s instead of
> >> weighing I/O. First I rebooted everything so that no old configuration
> >> of cgroup was left in place and then setup everything except the 100 and
> >> 1000 weight configuration.
> >>
> >> quote from blkio.txt:
> >> ------------
> >> - blkio.throttle.write_bps_device
> >>         - Specifies upper limit on WRITE rate to the device. IO rate is
> >>           specified in bytes per second. Rules are per deivce. Following is
> >>           the format.
> >>
> >>   echo "<major>:<minor>  <rate_bytes_per_second>" >
> >> /cgrp/blkio.write_bps_device
> >> -------------
> >>
> >> for vm in kernel1 kernel2 kernel3 kernel4 kernel5 kernel6 kernel7
> >> kernel8; do ls -lH /dev/vdisks/$vm; done
> >> brw-rw---- 1 root root 254, 23 Feb 18 13:45 /dev/vdisks/kernel1
> >> brw-rw---- 1 root root 254, 24 Feb 18 13:45 /dev/vdisks/kernel2
> >> brw-rw---- 1 root root 254, 25 Feb 18 13:45 /dev/vdisks/kernel3
> >> brw-rw---- 1 root root 254, 26 Feb 18 13:45 /dev/vdisks/kernel4
> >> brw-rw---- 1 root root 254, 27 Feb 18 13:45 /dev/vdisks/kernel5
> >> brw-rw---- 1 root root 254, 28 Feb 18 13:45 /dev/vdisks/kernel6
> >> brw-rw---- 1 root root 254, 29 Feb 18 13:45 /dev/vdisks/kernel7
> >> brw-rw---- 1 root root 254, 30 Feb 18 13:45 /dev/vdisks/kernel8
> >>
> >> /bin/echo 254:25 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:26 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:27 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:28 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:29 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:30 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >> /bin/echo 254:30 10000000 >
> >> /mnt/notimportant/blkio.throttle.write_bps_device
> >>
> >> Then I ran the previous test again. This resulted in an ever increasing
> >> load (last I checked was ~ 300) on the host system. (This is perfectly
> >> reproducible).
> >>
> >> uptime
> >> Fri Feb 18 14:42:17 2011
> >> 14:42:17 up 12 min,  9 users,  load average: 286.51, 142.22, 56.71
> > 
> > Have you run top or something to figure out why load average is shooting
> > up. I suspect that because of throttling limit, IO threads have been
> > blocked and qemu is forking more IO threads. Can you just run top/ps
> > and figure out what's happening.
> > 
> > Again, is it some kind of linear volume group from which you have carved
> > out logical volumes for each virtual machine?
> > 
> > For throttling to begin with, can we do a simple test first. That is
> > run a single virtual machine, put some throttling limit on logical volume
> > and try to do READs. Once READs work, lets test WRITES and check why
> > does system load go up.
> > 
> > Thanks
> > Vivek
> > 
> > --
> > libvir-list mailing list
> > libvir-list@xxxxxxxxxx
> > https://www.redhat.com/mailman/listinfo/libvir-list
> > 
> 
> -- 
> Regards
> Gui Jianfeng

--
libvir-list mailing list
libvir-list@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/libvir-list