guest blocked for a few seconds

Laszlo Budai <laszlo@xxxxxxxxxxxxxxxx> · Thu, 9 Mar 2017 01:32:36 +0200

Hello,

we have an openstack installation running on linux:
host kernel: 3.13.0-35-generic,
qemu: QEMU emulator version 2.2.0 (Debian 1:2.2+dfsg-5expubuntu9.7~cloud5)

the host OS is using its own physical disk, while the guests are using ceph for all their storage needs.
We have observed situations when the guests seems to be hanging/blocked for a few seconds, and we cannot figure out why.

I have created a simple script to record some parameters at every seconds and writes it to a file at every 10  minutes:
COUNT=600
date >> $LOGFILE
cut -f 1 -d\  /proc/uptime >> $LOGFILE
while :
do
   for ((i=0; i<$COUNT; i++))
   do
     read UPT[$i] rest < /proc/uptime
     read cpu CPUAVG[$i] < /proc/stat
     /bin/sleep 1
   done
   for ((i=0; i<$COUNT; i++))
   do
     echo ${UPT[$i]}" "${CPUAVG[$i]} >> $LOGFILE
   done
done

with this script I had a situation when there are more than 2 seconds between 2 consecutive records. This can be seen in the following series of data (the first column is the uptime, the rest are the counters from /proc/stat):
235503.52 21198 283 30839 23487544 3556 0 0 461168 0 0
235504.52 21198 283 30839 23487644 3556 0 0 461168 0 0
235505.53 21198 283 30839 23487744 3556 0 0 461168 0 0
235508.60 21199 283 30839 23487844 3556 0 0 461168 0 0 <--- 3 sec
235509.60 21200 283 30839 23487944 3556 0 0 461168 0 0
235510.61 21200 283 30839 23488044 3556 0 0 461168 0 0
235511.61 21200 283 30840 23488144 3556 0 0 461168 0 0
235513.98 21200 283 30840 23488244 3556 0 0 461168 0 0
235519.85 21201 283 30840 23488342 3558 0 0 461168 0 0 <--- 6 sec
235522.44 21201 283 30841 23488442 3558 0 0 461168 0 0 <--- 3 sec
235525.06 21202 283 30841 23488542 3558 0 0 461168 0 0 <--- 3 sec

235574.25 21212 283 30852 23493333 3561 0 0 461168 0 0
235577.44 21213 283 30852 23493433 3561 0 0 461168 0 0 <--- 3 sec

236616.81 21379 283 31108 23596808 3565 0 0 461168 0 0
236625.24 21380 283 31108 23596908 3565 0 0 461168 0 0 <--- 9 sec

236627.76 21380 283 31108 23597008 3565 0 0 461168 0 0
236630.02 21381 283 31108 23597066 3565 0 0 461168 0 0 <--- 3 sec

236632.54 21381 283 31108 23597101 3565 0 0 461168 0 0
236636.13 21382 283 31108 23597201 3565 0 0 461168 0 0 <--- 4 sec

237219.96 21476 283 31255 23655224 3573 0 0 461168 0 0
237222.56 21477 283 31255 23655324 3573 0 0 461168 0 0 <--- 3 sec
237225.55 21478 283 31255 23655424 3573 0 0 461168 0 0 <--- 3 sec

All these seems to come in groups that rarely spans more than a minute. The first 5 issues are within 72 seconds (start at 235505.53, ending at 235577.44), the next 3 are within 20 seconds (236616.81 - 236636.13), and the last group of issues spans 6 seconds (237219.96 - 237225.55).

if I'm checking the status of the host as recorded by sar for the same period I have:

~$ sar -f /var/log/sysstat/sa05 -s 02:00:00 -e 04:00:00
Linux 3.13.0-35-generic (dp-c2-bl03)    03/05/2017      _x86_64_        (40 CPU)

02:05:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
02:15:01 AM     all      0.99      0.00      0.38      0.00      0.00     98.64
02:25:01 AM     all      1.00      0.00      0.38      0.00      0.00     98.62
02:35:01 AM     all      1.21      0.00      0.51      0.00      0.00     98.28
02:45:01 AM     all      1.01      0.00      0.38      0.00      0.00     98.61
02:55:01 AM     all      1.17      0.00      0.50      1.00      0.00     97.33 <--- iowait >0 (this is when the first group has happened)
03:05:01 AM     all      1.02      0.00      0.37      0.00      0.00     98.61
03:15:01 AM     all      1.22      0.00      0.55      0.81      0.00     97.42 <--- iowait >0 (2nd group)
03:25:01 AM     all      1.04      0.00      0.45      0.38      0.00     98.13 <--- iowait >0 (3rd group)
03:35:01 AM     all      1.04      0.00      0.41      0.00      0.00     98.55
03:45:01 AM     all      1.10      0.00      0.43      0.00      0.00     98.46
03:55:01 AM     all      1.00      0.00      0.38      0.00      0.00     98.62
Average:        all      1.07      0.00      0.43      0.20      0.00     98.30

all the other parameters recorded by sar seems to be normal. No load on the disk is seen:

~$ sar -f /var/log/sysstat/sa05 -s 02:00:00 -e 04:00:00 -d
Linux 3.13.0-35-generic (dp-c2-bl03)    03/05/2017      _x86_64_        (40 CPU)

02:05:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
02:15:01 AM    dev8-0      0.76      0.00     26.01     34.30      0.00      0.01      0.01      0.00
02:25:01 AM    dev8-0      0.81      0.00     26.42     32.63      0.00      0.00      0.00      0.00
02:35:01 AM    dev8-0      0.76      0.00     25.93     34.20      0.00      0.00      0.00      0.00
02:45:01 AM    dev8-0      0.76      0.00     25.97     34.33      0.00      0.00      0.00      0.00
02:55:01 AM    dev8-0      0.75      0.00     25.91     34.47      0.00      0.01      0.01      0.00
03:05:01 AM    dev8-0      0.80      0.00     26.98     33.54      0.00      0.01      0.01      0.00
03:15:01 AM    dev8-0      0.74      0.00     24.92     33.68      0.00      0.02      0.02      0.00
03:25:01 AM    dev8-0      0.74      0.00     25.99     35.20      0.00      0.00      0.00      0.00

03:25:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
03:35:01 AM    dev8-0      0.77      0.00     26.57     34.29      0.00      0.02      0.02      0.00
03:45:01 AM    dev8-0      0.79      0.00     25.89     32.64      0.00      0.01      0.01      0.00
03:55:01 AM    dev8-0      0.77      0.00     26.03     33.66      0.00      0.00      0.00      0.00
Average:       dev8-0      0.77      0.00     26.06     33.88      0.00      0.01      0.01      0.00

~$ sar -f /var/log/sysstat/sa05 -s 02:00:00 -e 04:00:00 -S
Linux 3.13.0-35-generic (dp-c2-bl03)    03/05/2017      _x86_64_        (40 CPU)

02:05:01 AM kbswpfree kbswpused  %swpused  kbswpcad   %swpcad
02:15:01 AM   8388604         0      0.00         0      0.00
02:25:01 AM   8388604         0      0.00         0      0.00
02:35:01 AM   8388604         0      0.00         0      0.00
02:45:01 AM   8388604         0      0.00         0      0.00
02:55:01 AM   8388604         0      0.00         0      0.00
03:05:01 AM   8388604         0      0.00         0      0.00
03:15:01 AM   8388604         0      0.00         0      0.00
03:25:01 AM   8388604         0      0.00         0      0.00
03:35:01 AM   8388604         0      0.00         0      0.00
03:45:01 AM   8388604         0      0.00         0      0.00
03:55:01 AM   8388604         0      0.00         0      0.00
Average:      8388604         0      0.00         0      0.00

Did any of you has seen something similar? what can cause the rise of the iowait? (This is the only metric that seems to be strange in the given periods.)
What should I do to track down the reason for these hangs? What can be done to avoid them?

Any ideas are welcome.

Thank you,
Laszlo