On Mon, Feb 21, 2011 at 09:30:32AM +0100, Dominik Klein wrote: > On 02/21/2011 09:19 AM, Dominik Klein wrote: > >>> - Is it possible to capture 10-15 second blktrace on your underlying > >>> physical device. That should give me some idea what's happening. > >> > >> Will do, read on. > > > > Just realized I missed this one ... Had better done it right away. > > > > So here goes. > > > > Setup as in first email. 8 Machines, 2 important, 6 not important ones > > with a throttle of ~10M. group_isolation=1. Each vm dd'ing zeroes. > > > > blktrace -d /dev/sdb -w 30 > > === sdb === > > CPU 0: 4769 events, 224 KiB data > > CPU 1: 28079 events, 1317 KiB data > > CPU 2: 1179 events, 56 KiB data > > CPU 3: 5529 events, 260 KiB data > > CPU 4: 295 events, 14 KiB data > > CPU 5: 649 events, 31 KiB data > > CPU 6: 185 events, 9 KiB data > > CPU 7: 180 events, 9 KiB data > > CPU 8: 17 events, 1 KiB data > > CPU 9: 12 events, 1 KiB data > > CPU 10: 6 events, 1 KiB data > > CPU 11: 55 events, 3 KiB data > > CPU 12: 28005 events, 1313 KiB data > > CPU 13: 1542 events, 73 KiB data > > CPU 14: 4814 events, 226 KiB data > > CPU 15: 389 events, 19 KiB data > > CPU 16: 1545 events, 73 KiB data > > CPU 17: 119 events, 6 KiB data > > CPU 18: 3019 events, 142 KiB data > > CPU 19: 62 events, 3 KiB data > > CPU 20: 800 events, 38 KiB data > > CPU 21: 17 events, 1 KiB data > > CPU 22: 243 events, 12 KiB data > > CPU 23: 1 events, 1 KiB data > > Total: 81511 events (dropped 0), 3822 KiB data > > > > Very constant 296 blocked processes in vmstat during this run. But... > > apparently no data is written at all (see "bo" column). Hm..., this sounds bad. If you have put a limit of ~10Mb/s then no "bo" is bad. That would explain that why your box is not responding and you need to do power reset. - I am assuming that you have not put any throttling limits on root group. Is your system root also on /dev/sdb or on a separate disk altogether. - This sounds like a bug in throttling logic. To narrow it down can you start running "deadline" on end device. If it still happens, it is more or less in throttling layer. - We can also try to remove dm layers and just create partitions on /dev/sdb and export as virtio disks to virtual machines and take dm layer out of picture and see if it still happens. - In one of the mails you mentioned that with 1 virutal machine throttling READs and WRITEs is working for you. So it looks like 1 virtual machine does not hang but once you launch 8 virtual machines it hangs. Can we try increasing the number of vitual machines gragually and confirm that it happens only if some certain number of virtual machines are launched. - Can you also paste me the rules you have put on important and non-important groups. Somehow I suspect that some of the rule has gone horribly bad in the sense that it is very low and effectively no virtual machine is making any progress. - How long does it take to reach in this locked state where bo=0. - you can also try to redirect blktrace output to blkparse and redirect it to standard output and see capture some output by copying pasting last messages. In the mean time, I will try to launch more machines and see if I can reproduce the issue. Thanks Vivek > > > > vmstat 2 > > procs -----------memory---------- ---swap-- -----io---- -system-- > > ----cpu---- > > r b swpd free buff cache si so bi bo in cs us sy > > id wa > > 0 296 0 125254224 21432 142016 0 0 16 633 181 331 0 > > 0 93 7 > > 0 296 0 125253728 21432 142016 0 0 0 0 17115 33794 > > 0 0 25 75 > > 0 296 0 125254112 21432 142016 0 0 0 0 17084 33721 > > 0 0 25 74 > > 1 296 0 125254352 21440 142012 0 0 0 18 17047 33736 > > 0 0 25 75 > > 0 296 0 125304224 21440 131060 0 0 0 0 17630 33989 > > 0 1 23 76 > > 1 296 0 125306496 21440 130260 0 0 0 0 16810 33401 > > 0 0 20 80 > > 4 296 0 125307208 21440 129856 0 0 0 0 17169 33744 > > 0 0 26 74 > > 0 296 0 125307496 21448 129508 0 0 0 14 17105 33650 > > 0 0 36 64 > > 0 296 0 125307712 21452 129672 0 0 2 1340 17117 33674 > > 0 0 22 78 > > 1 296 0 125307752 21452 129520 0 0 0 0 16875 33438 > > 0 0 29 70 > > 1 296 0 125307776 21452 129520 0 0 0 0 16959 33560 > > 0 0 21 79 > > 1 296 0 125307792 21460 129520 0 0 0 12 16700 33239 > > 0 0 15 85 > > 1 296 0 125307808 21460 129520 0 0 0 0 16750 33274 > > 0 0 25 74 > > 1 296 0 125307808 21460 129520 0 0 0 0 17020 33601 > > 0 0 26 74 > > 1 296 0 125308272 21460 129520 0 0 0 0 17080 33616 > > 0 0 20 80 > > 1 296 0 125308408 21460 129520 0 0 0 0 16428 32972 > > 0 0 42 58 > > 1 296 0 125308016 21460 129524 0 0 0 0 17021 33624 > > 0 0 22 77 > > While we're on that ... It is impossible for me now to recover from this > state without pulling the power plug. > > On the VMs console I see messages like > INFO: task (kjournald|flush-254|dd|rs:main|...) blocked for more than > 120 seconds. If VMs are completely blocked and not making any progress, it is expected. > > While the ssh sessions through which the dd was started seem intact > (pressing enter gives a new line), it is impossible to cancel the dd > command. Logging in on the VMs console also is impossible. > > Opening a new ssh session to the host also does not work. Killing the > qemu-kvm processes from a session opened earlier leaves zomby processes. > Moving the VMs back to the root cgroup makes no difference either. > > Regards > Dominik > > -- > libvir-list mailing list > libvir-list@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/libvir-list -- libvir-list mailing list libvir-list@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/libvir-list