Hello, I have a perplexing problem I'm hoping I might be able to get some help with. This is a RHEL6 kvm host, with 4 idle CentOS5.6 guests. 3 guests have 1 vpu, 1 guest has 4 vpu. Host is an 8 core/16 thread Nahalem class machine and is idle other than the KVM guests. Probably every 5-10 minutes both the host and guests will become very sluggish to respond to input and likely any running workload. This lasts for maybe 4-5 minutes then everything returns to normal until it happens again 5-10 minutes later. repeat infinitely. CPU usage on the host is mostly idle during these sluggish periods. I've noticed a big drop in interrupts on the host during these periods and "missed X ticks" messages in dstat: normal (responsive) dstat output on host: ---------------------------- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 2 97 0 0 0| 0 0 | 53k 809k| 0 0 | 37k 37k 0 1 99 0 0 0| 0 47k| 46B 346B| 0 0 | 37k 35k 0 0 100 0 0 0| 0 0 | 394B 1038B| 0 0 | 30k 34k 0 2 98 0 0 0| 0 0 | 46B 346B| 0 0 | 36k 34k 1 3 96 0 0 0| 0 0 | 394B 1038B| 0 0 | 39k 34k 1 2 98 0 0 0| 0 0 | 46B 346B| 0 0 | 37k 35k 1 3 96 0 0 0| 0 1024B| 514B 1038B| 0 0 | 36k 39k 1 1 98 0 0 0| 0 0 | 46B 346B| 0 0 | 35k 42k 0 2 98 0 0 0| 0 0 | 394B 1038B| 0 0 | 38k 42k 0 2 98 0 0 0| 0 0 | 46B 346B| 0 0 | 37k 42k 1 2 97 0 0 0| 0 0 | 394B 1038B| 0 0 | 35k 41k 1 1 98 0 0 0| 0 0 | 46B 346B| 0 0 | 31k 39k example sluggish dstat output on host: ------------------------------- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 0 100 0 0 0| 0 0 |5902B 71k| 0 0 | 681 2657 0 1 99 0 0 0| 0 1024B| 652B 692B| 0 0 |5387 41k missed 2 ticks 0 1 99 0 0 0| 0 0 | 546B 788B| 0 0 |5741 43k missed 2 ticks 0 1 99 0 0 0| 0 0 | 546B 756B| 0 0 |5770 43k missed 2 ticks 1 1 98 0 0 0| 0 1024B| 184B 378B| 0 0 |8890 66k missed 2 ticks 0 0 99 0 0 0| 0 0 |1062B 1166B| 0 0 |4631 34k missed 2 ticks 0 2 98 0 0 0| 0 0 | 100B 378B| 0 0 |2680 24k On the guests (which are idle) there is also some interesting dstat output. During the sluggish periods, user,system, and interrupt cpu increases greatly and the number of interrupts doubles or triples. I can also usually count on dstat crashing on every guest as soon as I noticed the problem starting on the host: Traceback (most recent call last): File "/usr/bin/dstat", line 1974, in ? main() File "/usr/bin/dstat", line 1919, in main o.extract() File "/usr/bin/dstat", line 509, in extract self.val[name][i] = 100.0 * (self.cn2[name][i] - self.cn1[name][i]) / (sum(self.cn2[name]) - sum(self.cn1[name])) ZeroDivisionError: float division example normal (responsive) dstat output on guest: ---------------------- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1004 11 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1005 12 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1002 11 1 0 99 0 0 0| 0 0 | 60B 314B| 0 0 |1003 9 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1004 11 0 0 100 0 0 0| 0 0 | 106B 368B| 0 0 |1004 11 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1004 15 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1003 9 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1004 11 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |1003 9 example sluggish dstat output on guest: -------------------- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 20 20 60 0 0 0| 0 0 | 60B 404B| 0 0 |1840 8 0 0 100 0 0 0| 0 0 | 60B 314B| 0 0 |2341 8 17 0 50 0 0 33| 0 0 | 60B 404B| 0 0 |3374 8 0 0 33 0 0 67| 0 0 | 60B 314B| 0 0 |2002 11 0 0 0 0 0 100| 0 0 | 60B 314B| 0 0 |1943 5 0 50 50 0 0 0| 0 32k| 60B 420B| 0 0 | 922 18 33 0 67 0 0 0| 0 0 | 60B 404B| 0 0 |1563 9 I'd guess some sort of timer/clock issue, but I'm unsure of where to go from here? Any help would be appreciated. Thanks, TJ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html