Good way to monitor detailed latency/throughput

chibi@xxxxxxx (Christian Balzer) · Sat, 6 Sep 2014 18:53:27 +0900

On Fri, 05 Sep 2014 16:23:13 +0200 Josef Johansson wrote:

> Hi,
> 
> How do you guys monitor the cluster to find disks that behave bad, or
> VMs that impact the Ceph cluster?
> 
> I'm looking for something where I could get a good bird-view of
> latency/throughput, that uses something easy like SNMP.
> 
You mean there is another form of monitoring than waiting for the
users/customers to yell at you you because performance sucks? ^o^

The first part is relatively easy, run something like "iostat -y -x 300"
and feed the output into snmp via the extend functionality. Maybe somebody
has done that already, but it would be trivial anyway. 
The hard part here is what to do with that data, just graphing it is great
for post-mortem analysis or if you have 24h staff staring blindly at
monitors. 
Deciding what numbers warrant a warning or even a notification (in Nagios
terms) is going to be much harder.

Take this iostat -x output (all activities since boot) for example:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.30    0.02    2.61     0.43   405.49   308.65     0.03   10.44    0.50   10.52   0.83   0.22
sdb               0.00     0.30    0.01    2.55     0.27   379.16   296.20     0.03   11.31    0.73   11.35   0.80   0.21
sdc               0.00     0.29    0.02    2.44     0.38   376.57   307.23     0.03   11.82    0.56   11.89   0.84   0.21
sdd               0.00     0.29    0.01    2.42     0.24   369.05   304.43     0.03   11.51    0.63   11.55   0.84   0.20
sde               0.02   266.52    0.65    2.93    72.56   365.03   244.67     0.29   79.75    1.65   97.16   1.60   0.57
sdg               0.01     0.97    0.72    0.65    76.33   187.84   384.75     0.09   69.06    1.85  143.21   2.87   0.39
sdf               0.01     0.87    0.68    0.59    67.04   167.94   369.82     0.09   67.58    2.79  143.18   3.44   0.44
sdh               0.00     0.94    0.94    0.64    74.87   182.81   327.19     0.09   57.34    1.91  139.22   2.79   0.44
sdj               0.01     0.96    0.93    0.65    75.76   187.75   331.78     0.10   62.76    1.81  149.88   2.72   0.43
sdk               0.01     1.02    1.00    0.67    77.78   188.83   320.46     0.08   47.02    1.66  115.02   2.53   0.42
sdi               0.01     0.93    0.96    0.61    74.38   173.72   317.35     0.22  140.56    2.16  358.85   3.49   0.54
sdl               0.01     0.92    0.71    0.62    72.57   175.19   373.05     0.09   65.36    2.01  138.19   3.03   0.40

sda to sdd are SSDs. So for starters, you can't compare them with
spinning rust. So if you were to look for outliers, all of sde to sdl
(actual disks) are suspiciously slow. ^o^
And if you look at sde it seems to be faster than the rest, but that is
because the original drive was replaced and thus the new one has seen
less action than the rest.
The actual wonky drive is sdi, looking at await/w_await and svctm. 
This drive sometimes goes into a state (for 10-20 hours at a time) where
it can only perform at half speed.

These are the same drives when running a rados bench against the cluster,
sdi is currently not wonky and performing at full speed:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sde               0.00   173.00    0.00  236.60     0.00 91407.20   772.67    76.40  338.32    0.00  338.32   4.21  99.60
sdg               0.00   153.00    0.40  234.60     1.60 88052.40   749.40    83.61  359.95   23.00  360.52   4.24  99.68
sdf               0.00   147.30    0.50  206.00     2.00 68918.40   667.51    50.15  264.40   65.60  264.88   4.45  91.88
sdh               0.00   158.10    0.80  170.90     3.20 66077.20   769.72    31.31  153.45   12.50  154.11   5.40  92.76
sdj               0.00   158.00    0.60  207.00     2.40 77455.20   746.22    61.61  296.78   55.33  297.48   4.79  99.52
sdk               0.00   160.90    0.90  242.30     3.60 92251.20   758.67    57.11  234.84   40.44  235.57   4.06  98.68
sdi               0.00   166.70    1.00  190.90     4.00 69919.20   728.75    60.15  282.98   24.00  284.34   5.16  99.00
sdl               0.00   131.90    0.80  207.10     3.20 85014.00   817.87    92.10  412.02   53.00  413.41   4.79  99.52

Now things are more uniform (of course ceph never is really uniform and
sdh was more busy and thus slower in the next sample).
If sdi were in its half speed mode, it would at 100% (all the time while
the other drives were not and often even idle) and with a svctm of about 15
and w_await well over 800.
You could simply say that with this baseline, anything that goes over 500
w_await is worthy an alert, but it might only get there if your cluster is
sufficiently busy.
To really find a "slow" disk, you need to compare identical disks having
the same workload.
Personally I'm still not sure what formula to use, even though it is so
blatantly obvious and visible when you look at the data.

You probably want to monitor your storage nodes for something simple as
load and based on your testing and experience set warning/notification
levels accordingly. This will give you a heads up so you can start
investing things in detail.

Now bad VMs, that's far harder, at least from where I'm standing.
If you'd use the kernel space RBD to access images, it would be visible as
disk I/O of that qemu process (assuming KVM here). Trivial to get that
data and correlate it with a VM. 
With the far more common (and faster) user space librbd access, the host OS
no longer sees that activity as disk I/O. 
QMP and thus libvirt (virt-top) might get it right, but I haven't
investigated QMP yet and don't use libvirt here.

Thus iostat level statistics per RBD image provided by Ceph would be
really nice[TM] in my books.

Regards,

Christian
> Regards,
> Josef Johansson
> _______________________________________________
> ceph-users mailing list
> ceph-users at lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Christian Balzer        Network/Systems Engineer                
chibi at gol.com   	Global OnLine Japan/Fusion Communications
http://www.gol.com/