Re: Unexpected period of iowait, no obvious activity?

Gregory Farnum <greg@xxxxxxxxxxx> · Tue, 23 Jun 2015 13:13:17 +0100

On Fri, Jun 19, 2015 at 7:50 PM, Daniel Schneller
<daniel.schneller@xxxxxxxxxxxxxxxx> wrote:
> Hi!
>
> Recently over a few hours our 4 Ceph disk nodes showed unusually high
> and somewhat constant iowait times. Cluster runs 0.94.1 on Ubuntu
> 14.04.1.
>
> It started on one node, then - with maybe 15 minutes delay each - on the
> next and the next one. Overall duration of the phenomenon was about 90
> minutes on each machine, finishing in the same order they had started.
>
> We could not see any obvious cluster activity during that time,
> applications did not do anything out of the ordinary. Scrubbing and deep
> scrubbing were turned off long before this happened.
>
> We are using CephFS for shared administrator home directories on the
> system, RBD volumes for OpenStack and the Rados Gateway to manage
> application data via the Swift interface. Telemetry and logs from inside
> the VMs did not offer an explanation either.
>
> The fact that these readings were limited to OSD hosts, but none of the
> other (client) nodes in the system, suggests this must be some kind of
> Ceph behaviour. Any ideas? We would like to understand what the system
> was doing, but haven't found anything obvious in the logs.

Not necessarily — if they were truly offset it's unlikely Ceph did
that: everything Ceph does that involves changing read/write rates
involves doing it across multiple nodes. Writes from clients are
shared; reads only go to primaries but will not be so strongly
segregated across machines; there's nothing else but scrubbing and
recovery...

On the other hand, there are lots of administrative tasks that can run
and do something like this. The CERN guys had a lot of trouble with
some daemon which wanted to scan the OSD's entire store for tracking
changes, and was installed by their standard Ubuntu deployment. You
probably want to look for something like that, and try and figure out
which processes are doing all the IO during the time things get slow.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com