RHEL6 host / CentOS5.6 guest periodically very sluggish

T Johnson <tjohnson46@xxxxxxxxx> · Sun, 8 May 2011 23:33:26 -0400

Hello,

I have a perplexing problem I'm hoping I might be able to get some
help with. This is a RHEL6 kvm host, with 4 idle CentOS5.6 guests. 3
guests have 1 vpu, 1 guest has 4 vpu. Host is an 8 core/16 thread
Nahalem class machine and is idle other than the KVM guests.

Probably every 5-10 minutes both the host and guests will become very
sluggish to respond to input and likely any running workload. This
lasts for maybe 4-5 minutes then everything returns to normal until it
happens again 5-10 minutes later. repeat infinitely. CPU usage on the
host is mostly idle during these sluggish periods. I've noticed a big
drop in interrupts on the host during these periods and "missed X
ticks" messages in dstat:

normal (responsive) dstat output on host:
----------------------------
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
  0   2  97   0   0   0|   0     0 |  53k  809k|   0     0 |  37k   37k
  0   1  99   0   0   0|   0    47k|  46B  346B|   0     0 |  37k   35k
  0   0 100   0   0   0|   0     0 | 394B 1038B|   0     0 |  30k   34k
  0   2  98   0   0   0|   0     0 |  46B  346B|   0     0 |  36k   34k
  1   3  96   0   0   0|   0     0 | 394B 1038B|   0     0 |  39k   34k
  1   2  98   0   0   0|   0     0 |  46B  346B|   0     0 |  37k   35k
  1   3  96   0   0   0|   0  1024B| 514B 1038B|   0     0 |  36k   39k
  1   1  98   0   0   0|   0     0 |  46B  346B|   0     0 |  35k   42k
  0   2  98   0   0   0|   0     0 | 394B 1038B|   0     0 |  38k   42k
  0   2  98   0   0   0|   0     0 |  46B  346B|   0     0 |  37k   42k
  1   2  97   0   0   0|   0     0 | 394B 1038B|   0     0 |  35k   41k
  1   1  98   0   0   0|   0     0 |  46B  346B|   0     0 |  31k   39k

example sluggish dstat output on host:
-------------------------------
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 0   0 100   0   0   0|   0     0 |5902B   71k|   0     0 | 681  2657
  0   1  99   0   0   0|   0  1024B| 652B  692B|   0     0 |5387
41k missed 2 ticks
  0   1  99   0   0   0|   0     0 | 546B  788B|   0     0 |5741
43k missed 2 ticks
  0   1  99   0   0   0|   0     0 | 546B  756B|   0     0 |5770
43k missed 2 ticks
  1   1  98   0   0   0|   0  1024B| 184B  378B|   0     0 |8890
66k missed 2 ticks
  0   0  99   0   0   0|   0     0 |1062B 1166B|   0     0 |4631
34k missed 2 ticks
  0   2  98   0   0   0|   0     0 | 100B  378B|   0     0 |2680    24k

On the guests (which are idle) there is also some interesting dstat
output. During the sluggish periods, user,system, and interrupt cpu
increases greatly and the number of interrupts doubles or triples. I
can also usually count on dstat crashing on every guest as soon as I
noticed the problem starting on the host:

Traceback (most recent call last):
  File "/usr/bin/dstat", line 1974, in ?
    main()
  File "/usr/bin/dstat", line 1919, in main
    o.extract()
  File "/usr/bin/dstat", line 509, in extract
    self.val[name][i] = 100.0 * (self.cn2[name][i] -
self.cn1[name][i]) / (sum(self.cn2[name]) - sum(self.cn1[name]))
ZeroDivisionError: float division

example normal (responsive) dstat output on guest:
----------------------
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1004    11
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1005    12
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1002    11
  1   0  99   0   0   0|   0     0 |  60B  314B|   0     0 |1003     9
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1004    11
  0   0 100   0   0   0|   0     0 | 106B  368B|   0     0 |1004    11
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1004    15
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1003     9
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1004    11
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |1003     9

example sluggish dstat output on guest:
--------------------
usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw
 20  20  60   0   0   0|   0     0 |  60B  404B|   0     0 |1840     8
  0   0 100   0   0   0|   0     0 |  60B  314B|   0     0 |2341     8
 17   0  50   0   0  33|   0     0 |  60B  404B|   0     0 |3374     8
  0   0  33   0   0  67|   0     0 |  60B  314B|   0     0 |2002    11
  0   0   0   0   0 100|   0     0 |  60B  314B|   0     0 |1943     5
  0  50  50   0   0   0|   0    32k|  60B  420B|   0     0 | 922    18
 33   0  67   0   0   0|   0     0 |  60B  404B|   0     0 |1563     9

I'd guess some sort of timer/clock issue, but I'm unsure of where to
go from here? Any help would be appreciated.

Thanks,
TJ
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html