Re: I/O Performance Tips

Stefan Hajnoczi <stefanha@xxxxxxxxx> · Thu, 9 Dec 2010 14:51:28 +0000

On Thu, Dec 9, 2010 at 12:52 PM, Sebastian Nickel - Hetzner Online AG
<sebastian.nickel@xxxxxxxxxx> wrote:
> here is the qemu command line we are using (or which libvirt generates):
>
> /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 512 -smp
> 1,sockets=1,cores=1,threads=1 -name vm-933 -uuid
> 0d737610-e59b-012d-f453-32287f7402ab -nodefaults -chardev
> socket,id=monitor,path=/var/lib/libvirt/qemu/vm-933.monitor,server,nowait -mon chardev=monitor,mode=readline -rtc base=utc -boot nc -drive file=/dev/vg0/934,if=none,id=drive-ide0-0-0,boot=on,format=raw,cache=writeback -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -device rtl8139,vlan=0,id=net0,mac=00:1c:14:01:03:67,bus=pci.0,addr=0x3 -net tap,fd=23,vlan=0,name=hostnet0 -chardev pty,id=serial0 -device isa-serial,chardev=serial0 -usb -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
>
> We are explicit using "writeback" in cache settings.
>
>> The first backtrace is odd.  You are using logical volumes for the
>> guest but the backtrace shows kjournald is blocked.  I believe logical
>> volumes should not directly affect kjournald at all (they don't use
>> journalling).  Perhaps this is a deadlock.
> The dmesg output was just an example. Most of the time I can see the tasks "kjournald"
> and "flush".
> I recently saw "kvm","kthreadd","rsyslogd" and others in such outputs. I thought
> that sometimes /proc/sys/vm/dirty_ratio gets exceeded and so all processes are blocked
> for writes to the cache (kvm processes too). Could this be the case?
> I set "dirty_background_ratio" to 5 to constantly flush the cache to the disk, but this
> did not help.
>
>> About the "flush-251:0:505" hang, please cat /proc/partitions on the
>> host to see which block device has major number 251 and minor number 0
>> is.
> This is our logical volume "root" partition of the physical host.
>
>> The fact that your host is having problems suggests the issue is not
>> in qemu-kvm (it's just a userspace process).  Are you sure disk I/O is
>> working under load on this machine without KVM?
> I do not think that kvm generates this issue (as you said it is a normal
> user space process). I thought that perhaps somebody knows how to handle
> this situation, because the kvm developers have much more experience with
> kvm than I do. Perhaps there are some tuning tips for this or anybody knows
> why only OpenSuse sets the filesystem read only if there are disk timeouts
> in the guest? This behavior appeared on almost all hosts (>20) so I can
> eliminate a single machine HW failure.

Christoph any pointers on how to debug this?

The backtraces from the original email are below:

> Am Donnerstag, den 09.12.2010, 10:30 +0000 schrieb Stefan Hajnoczi:
>> On Thu, Dec 9, 2010 at 8:10 AM, Sebastian Nickel - Hetzner Online AG
>> <sebastian.nickel@xxxxxxxxxx> wrote:
>> > Hello,
>> > we have got some issues with I/O in our kvm environment. We are using
>> > kernel version 2.6.32 (Ubuntu 10.04 LTS) to virtualise our hosts and we
>> > are using ksm, too. Recently we noticed that sometimes the guest systems
>> > (mainly OpenSuse guest systems) suddenly have a read only filesystem.
>> > After some inspection we found out that the guest system generates some
>> > ata errors due to timeouts (mostly in "flush cache" situations). On the
>> > physical host there are always the same kernel messages when this
>> > happens:
>> >
>> > """
>> > [1508127.195469] INFO: task kjournald:497 blocked for more than 120
>> > seconds.
>> > [1508127.212828] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [1508127.246841] kjournald     D 00000000ffffffff     0   497      2
>> > 0x00000000
>> > [1508127.246848]  ffff88062128dba0 0000000000000046 0000000000015bc0
>> > 0000000000015bc0
>> > [1508127.246855]  ffff880621089ab0 ffff88062128dfd8 0000000000015bc0
>> > ffff8806210896f0
>> > [1508127.246862]  0000000000015bc0 ffff88062128dfd8 0000000000015bc0
>> > ffff880621089ab0
>> > [1508127.246868] Call Trace:
>> > [1508127.246880]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
>> > [1508127.246889]  [<ffffffff81557d87>] io_schedule+0x47/0x70
>> > [1508127.246893]  [<ffffffff8116e545>] sync_buffer+0x45/0x50
>> > [1508127.246897]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
>> > [1508127.246901]  [<ffffffff8116e500>] ? sync_buffer+0x0/0x50
>> > [1508127.246905]  [<ffffffff81558338>] out_of_line_wait_on_bit_lock
>> > +0x78/0x90
>> > [1508127.246911]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
>> > [1508127.246915]  [<ffffffff8116e6c6>] __lock_buffer+0x36/0x40
>> > [1508127.246920]  [<ffffffff81213d11>] journal_submit_data_buffers
>> > +0x311/0x320
>> > [1508127.246924]  [<ffffffff81213ff2>] journal_commit_transaction
>> > +0x2d2/0xe40
>> > [1508127.246931]  [<ffffffff810397a9>] ? default_spin_lock_flags
>> > +0x9/0x10
>> > [1508127.246935]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
>> > [1508127.246939]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
>> > [1508127.246943]  [<ffffffff81217f0d>] kjournald+0xed/0x250
>> > [1508127.246947]  [<ffffffff81085090>] ? autoremove_wake_function
>> > +0x0/0x40
>> > [1508127.246951]  [<ffffffff81217e20>] ? kjournald+0x0/0x250
>> > [1508127.246954]  [<ffffffff81084d16>] kthread+0x96/0xa0
>> > [1508127.246959]  [<ffffffff810141ea>] child_rip+0xa/0x20
>> > [1508127.246962]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
>> > [1508127.246966]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
>> > [1508127.246969] INFO: task flush-251:0:505 blocked for more than 120
>> > seconds.
>> > [1508127.264076] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> > disables this message.
>> > [1508127.298343] flush-251:0   D ffffffff810f4370     0   505      2
>> > 0x00000000
>> > [1508127.298349]  ffff880621fdba30 0000000000000046 0000000000015bc0
>> > 0000000000015bc0
>> > [1508127.298354]  ffff88062108b1a0 ffff880621fdbfd8 0000000000015bc0
>> > ffff88062108ade0
>> > [1508127.298358]  0000000000015bc0 ffff880621fdbfd8 0000000000015bc0
>> > ffff88062108b1a0
>> > [1508127.298362] Call Trace:
>> > [1508127.298370]  [<ffffffff810f4370>] ? sync_page+0x0/0x50
>> > [1508127.298375]  [<ffffffff81557d87>] io_schedule+0x47/0x70
>> > [1508127.298379]  [<ffffffff810f43ad>] sync_page+0x3d/0x50
>> > [1508127.298383]  [<ffffffff8155825a>] __wait_on_bit_lock+0x5a/0xc0
>> > [1508127.298391]  [<ffffffff810f4347>] __lock_page+0x67/0x70
>> > [1508127.298395]  [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
>> > [1508127.298402]  [<ffffffff810f4487>] ? unlock_page+0x27/0x30
>> > [1508127.298410]  [<ffffffff810fd9dd>] write_cache_pages+0x3bd/0x4d0
>> > [1508127.298417]  [<ffffffff810fc670>] ? __writepage+0x0/0x40
>> > [1508127.298425]  [<ffffffff810fdb14>] generic_writepages+0x24/0x30
>> > [1508127.298432]  [<ffffffff810fdb55>] do_writepages+0x35/0x40
>> > [1508127.298439]  [<ffffffff811668a6>] writeback_single_inode+0xf6/0x3d0
>> > [1508127.298449]  [<ffffffff812b81d6>] ? rb_erase+0xd6/0x160
>> > [1508127.298455]  [<ffffffff8116750e>] writeback_inodes_wb+0x40e/0x5e0
>> > [1508127.298462]  [<ffffffff811677ea>] wb_writeback+0x10a/0x1d0
>> > [1508127.298469]  [<ffffffff81077719>] ? try_to_del_timer_sync+0x79/0xd0
>> > [1508127.298477]  [<ffffffff8155803d>] ? schedule_timeout+0x19d/0x300
>> > [1508127.298485]  [<ffffffff81167b1c>] wb_do_writeback+0x18c/0x1a0
>> > [1508127.298493]  [<ffffffff81167b83>] bdi_writeback_task+0x53/0xe0
>> > [1508127.298503]  [<ffffffff8110f546>] bdi_start_fn+0x86/0x100
>> > [1508127.298510]  [<ffffffff8110f4c0>] ? bdi_start_fn+0x0/0x100
>> > [1508127.298518]  [<ffffffff81084d16>] kthread+0x96/0xa0
>> > [1508127.298526]  [<ffffffff810141ea>] child_rip+0xa/0x20
>> > [1508127.298533]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
>> > [1508127.298541]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
>> > ... some more messages like those before...
>> > """
>> > Sometimes the message "INFO: task kvm blocked for more than 120 seconds"
>> > appears, too. I thought the error happens in cache writeback situations
>> > so I started to adjust "/proc/sys/vm/dirty_background_ratio" to 5 and
>> > "/proc/sys/vm/dirty_ratio" to 40. I thought this will write continously
>> > smaller parts of cached memory to HDD (more often, but smaller chunks).
>> > This did not help. There are still "readonly" filesystems in the guest
>> > systems. Does anybody has some tips to regulate I/O on linux systems or
>> > to stop those "readonly" filesystems?
>> [...]
>> > We are using logical volumes as virtual guest HDDs. The volume group is
>> > on top of a mdraid device.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html