On Fri, 05 Sep 2014 16:23:13 +0200 Josef Johansson wrote: > Hi, > > How do you guys monitor the cluster to find disks that behave bad, or > VMs that impact the Ceph cluster? > > I'm looking for something where I could get a good bird-view of > latency/throughput, that uses something easy like SNMP. > You mean there is another form of monitoring than waiting for the users/customers to yell at you you because performance sucks? ^o^ The first part is relatively easy, run something like "iostat -y -x 300" and feed the output into snmp via the extend functionality. Maybe somebody has done that already, but it would be trivial anyway. The hard part here is what to do with that data, just graphing it is great for post-mortem analysis or if you have 24h staff staring blindly at monitors. Deciding what numbers warrant a warning or even a notification (in Nagios terms) is going to be much harder. Take this iostat -x output (all activities since boot) for example: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.30 0.02 2.61 0.43 405.49 308.65 0.03 10.44 0.50 10.52 0.83 0.22 sdb 0.00 0.30 0.01 2.55 0.27 379.16 296.20 0.03 11.31 0.73 11.35 0.80 0.21 sdc 0.00 0.29 0.02 2.44 0.38 376.57 307.23 0.03 11.82 0.56 11.89 0.84 0.21 sdd 0.00 0.29 0.01 2.42 0.24 369.05 304.43 0.03 11.51 0.63 11.55 0.84 0.20 sde 0.02 266.52 0.65 2.93 72.56 365.03 244.67 0.29 79.75 1.65 97.16 1.60 0.57 sdg 0.01 0.97 0.72 0.65 76.33 187.84 384.75 0.09 69.06 1.85 143.21 2.87 0.39 sdf 0.01 0.87 0.68 0.59 67.04 167.94 369.82 0.09 67.58 2.79 143.18 3.44 0.44 sdh 0.00 0.94 0.94 0.64 74.87 182.81 327.19 0.09 57.34 1.91 139.22 2.79 0.44 sdj 0.01 0.96 0.93 0.65 75.76 187.75 331.78 0.10 62.76 1.81 149.88 2.72 0.43 sdk 0.01 1.02 1.00 0.67 77.78 188.83 320.46 0.08 47.02 1.66 115.02 2.53 0.42 sdi 0.01 0.93 0.96 0.61 74.38 173.72 317.35 0.22 140.56 2.16 358.85 3.49 0.54 sdl 0.01 0.92 0.71 0.62 72.57 175.19 373.05 0.09 65.36 2.01 138.19 3.03 0.40 sda to sdd are SSDs. So for starters, you can't compare them with spinning rust. So if you were to look for outliers, all of sde to sdl (actual disks) are suspiciously slow. ^o^ And if you look at sde it seems to be faster than the rest, but that is because the original drive was replaced and thus the new one has seen less action than the rest. The actual wonky drive is sdi, looking at await/w_await and svctm. This drive sometimes goes into a state (for 10-20 hours at a time) where it can only perform at half speed. These are the same drives when running a rados bench against the cluster, sdi is currently not wonky and performing at full speed: Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sde 0.00 173.00 0.00 236.60 0.00 91407.20 772.67 76.40 338.32 0.00 338.32 4.21 99.60 sdg 0.00 153.00 0.40 234.60 1.60 88052.40 749.40 83.61 359.95 23.00 360.52 4.24 99.68 sdf 0.00 147.30 0.50 206.00 2.00 68918.40 667.51 50.15 264.40 65.60 264.88 4.45 91.88 sdh 0.00 158.10 0.80 170.90 3.20 66077.20 769.72 31.31 153.45 12.50 154.11 5.40 92.76 sdj 0.00 158.00 0.60 207.00 2.40 77455.20 746.22 61.61 296.78 55.33 297.48 4.79 99.52 sdk 0.00 160.90 0.90 242.30 3.60 92251.20 758.67 57.11 234.84 40.44 235.57 4.06 98.68 sdi 0.00 166.70 1.00 190.90 4.00 69919.20 728.75 60.15 282.98 24.00 284.34 5.16 99.00 sdl 0.00 131.90 0.80 207.10 3.20 85014.00 817.87 92.10 412.02 53.00 413.41 4.79 99.52 Now things are more uniform (of course ceph never is really uniform and sdh was more busy and thus slower in the next sample). If sdi were in its half speed mode, it would at 100% (all the time while the other drives were not and often even idle) and with a svctm of about 15 and w_await well over 800. You could simply say that with this baseline, anything that goes over 500 w_await is worthy an alert, but it might only get there if your cluster is sufficiently busy. To really find a "slow" disk, you need to compare identical disks having the same workload. Personally I'm still not sure what formula to use, even though it is so blatantly obvious and visible when you look at the data. You probably want to monitor your storage nodes for something simple as load and based on your testing and experience set warning/notification levels accordingly. This will give you a heads up so you can start investing things in detail. Now bad VMs, that's far harder, at least from where I'm standing. If you'd use the kernel space RBD to access images, it would be visible as disk I/O of that qemu process (assuming KVM here). Trivial to get that data and correlate it with a VM. With the far more common (and faster) user space librbd access, the host OS no longer sees that activity as disk I/O. QMP and thus libvirt (virt-top) might get it right, but I haven't investigated QMP yet and don't use libvirt here. Thus iostat level statistics per RBD image provided by Ceph would be really nice[TM] in my books. Regards, Christian > Regards, > Josef Johansson > _______________________________________________ > ceph-users mailing list > ceph-users at lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Christian Balzer Network/Systems Engineer chibi at gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/