FW: cgroup blkio.weight working, but not for KVM guests

"Ben Clay" <rbclay@xxxxxxxx> · Mon, 22 Oct 2012 07:36:34 -0600

Forwarding this to the KVM general list.  I doubt you folks can help me with
libvirt, but I was wondering if there?s some way to verify if the cache=none
parameter is being respected for my KVM guest?s disk image, or if there are
any other configuration/debug steps appropriate for KVM + virtio + cgroup.

Thanks.

Ben Clay
rbclay@xxxxxxxx

From: Ben Clay [mailto:rbclay@xxxxxxxx] 
Sent: Wednesday, October 17, 2012 11:31 AM
To: libvirt-users@xxxxxxxxxx
Subject: cgroup blkio.weight working, but not for KVM guests

I?m running libvirt 0.10.2 and qemu-kvm-1.2.0, both compiled from source, on
CentOS 6.  I?ve got a working blkio cgroup hierarchy which I?m attaching
guests to using the following XML guest configs:

VM1 (foreground):

  <cputune>
    <shares>2048</shares>
  </cputune>
  <blkiotune>
    <weight>1000</weight>
  </blkiotune>

VM2 (background): 

  <cputune>
    <shares>2</shares>
  </cputune>
  <blkiotune>
    <weight>100</weight>
  </blkiotune>

I?ve tested write throughput on the host using cgexec and dd, demonstrating
that libvirt has correctly set up the cgroups:

cgexec -g blkio:libvirt/qemu/foreground time dd if=/dev/zero of=trash1.img
oflag=direct bs=1M count=4096 & cgexec -g blkio:libvirt/qemu/background time
dd if=/dev/zero of=trash2.img oflag=direct bs=1M count=4096 &

Snap from iotop, showing an 8:1 ratio (should be 10:1, but 8:1 is
acceptable):

Total DISK READ: 0.00 B/s | Total DISK WRITE: 91.52 M/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
9602 be/4 root        0.00 B/s   10.71 M/s  0.00 % 98.54 % dd if=/dev/zero
of=trash2.img oflag=direct bs=1M count=4096
9601 be/4 root        0.00 B/s   80.81 M/s  0.00 % 97.76 % dd if=/dev/zero
of=trash1.img oflag=direct bs=1M count=4096

Further, checking the task list inside each cgroup shows the guest?s main
PID, plus those of the virtio kernel threads.  It?s hard to tell if all the
virtio kernel threads are listed, but all the ones I?ve hunted down appear
to be there.

However, when running the same dd commands inside the guests, I get
roughly-equal performance ? nowhere near the ~8:1 relative bandwidth
enforcement I get from the host: (background ctrl-c?d right after foreground
finishes, both started within 1s of each other)

[ben@foreground ~]$ dd if=/dev/zero of=trash1.img oflag=direct bs=1M
count=4096
4096+0 records in
4096+0 records out
4294967296 bytes (4.3 GB) copied, 104.645 s, 41.0 MB/s

[ben@background ~]$ dd if=/dev/zero of=trash2.img oflag=direct bs=1M
count=4096
^C4052+0 records in
4052+0 records out
4248829952 bytes (4.2 GB) copied, 106.318 s, 40.0 MB/s

I thought based on this statement: ?Currently, the Block I/O subsystem does
not work for buffered write operations. It is primarily targeted at direct
I/O, although it works for buffered read operations.? from this page:
https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/6/ht
ml/Resource_Management_Guide/ch-Subsystems_and_Tunable_Parameters.html that
this problem might be due to host-side buffering, but I have that explicitly
disabled in my guest configs:

  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type="file" device="disk">
      <driver name="qemu" type="raw" cache="none"/>
      <source file="/path/to/disk.img"/>
      <target dev="vda" bus="virtio"/>
      <alias name="virtio-disk0"/>
      <address type="pci" domain="0x0000" bus="0x00" slot="0x04"
function="0x0"/>
    </disk>

Here is the qemu line from ps, showing that it?s clearly being passed
through from the guest XML config:

root      5110 20.8  4.3 4491352 349312 ?      Sl   11:58   0:38
/usr/bin/qemu-kvm -name background -S -M pc-1.2 -enable-kvm -m 2048 -smp
2,sockets=2,cores=1,threads=1 -uuid ea632741-c7be-36ab-bd69-da3cbe505b38
-no-user-config -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/background.monitor,server,n
owait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc
-no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive
file=/path/to/disk.img,if=none,id=drive-virtio-disk0,format=raw,cache=none
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virti
o-disk0,bootindex=1 -netdev tap,fd=20,id=hostnet0,vhost=on,vhostfd=22
-device
virtio-net-pci,netdev=hostnet0,id=net0,mac=00:11:22:33:44:55,bus=pci.0,addr=
0x3 -chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc
127.0.0.1:1 -vga cirrus -device
virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

For fun I tried a few different cache options to try to force a bypass the
host buffercache, including writethough and directsync, but the number of
virtio kernel threads appeared to explode (especially for directsync) and
the throughput dropped quite low: ~50% of ?none? for writethrough and ~5%
for directsync.

With cache=none, when I generate write loads inside the VMs, I do see growth
in the host?s buffer cache.  Further, if I use non-direct I/O inside the
VMs, and inflate the balloon (forcing the guest?s buffer cache to flush), I
don?t see a corresponding drop in background throughput.  Is it possible
that the cache="none" directive is not being respected?  

Since cgroups is working for host-side processes I think my blkio subsystem
is correctly set up (using cfq, group_isolation=1 etc).  Maybe I miscompiled
qemu, without some needed direct I/O support?  Has anyone seen this before?

Ben Clay
rbclay@xxxxxxxx

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html