Unexpected period of iowait, no obvious activity?

Daniel Schneller <daniel.schneller@xxxxxxxxxxxxxxxx> · Fri, 19 Jun 2015 20:50:59 +0200

Hi!

Recently over a few hours our 4 Ceph disk nodes showed unusually high
and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu
14.04.1.

It started on one node, then - with maybe 15 minutes delay each - on the
next and the next one. Overall duration of the phenomenon was about 90
minutes on each machine, finishing in the same order they had started.

We could not see any obvious cluster activity during that time,
applications did not do anything out of the ordinary. Scrubbing and deep
scrubbing were turned off long before this happened.

We are using CephFS for shared administrator home directories on the
system, RBD volumes for OpenStack and the Rados Gateway to manage
application data via the Swift interface. Telemetry and logs from inside
the VMs did not offer an explanation either.

The fact that these readings were limited to OSD hosts, but none of the
other (client) nodes in the system, suggests this must be some kind of
Ceph behaviour. Any ideas? We would like to understand what the system
was doing, but haven't found anything obvious in the logs.

Thanks!
Daniel

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com