Re: ceph-osd pegging CPU on giant, no snapshots involved this time

Florian Haas <florian@xxxxxxxxxxx> · Mon, 23 Feb 2015 19:21:33 +0100

On Wed, Feb 18, 2015 at 9:19 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
> Hey everyone,
>
> I must confess I'm still not fully understanding this problem and
> don't exactly know where to start digging deeper, but perhaps other
> users have seen this and/or it rings a bell.
>
> System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
> different rulesets where the problem applies to hosts and PGs using a
> bog-standard default crushmap.
>
> Symptom: out of the blue, ceph-osd processes on a single OSD node
> start going to 100% CPU utilization. The problems turns so bad that
> the machine is effectively becoming CPU bound and can't cope with any
> client requests anymore. Stopping and restarting all OSDs brings the
> problem right back, as does rebooting the machine — right after
> ceph-osd processes start, CPU utilization shoots up again. Stopping
> and marking out several OSDs on the machine makes the problem go away
> but obviously causes massive backfilling. All the logs show while CPU
> utilization is implausibly high are slow requests (which would be
> expected in a system that can barely do anything).
>
> Now I've seen issues like this before on dumpling and firefly, but
> besides the fact that they have all been addressed and should now be
> fixed, they always involved the prior mass removal of RBD snapshots.
> This system only used a handful of snapshots in testing, and is
> presently not using any snapshots at all.
>
> I'll be spending some time looking for clues in the log files of the
> OSDs that were shut down which caused the problem to go away, but if
> this sounds familiar to anyone willing to offer clues, I'd be more
> than interested. :) Thanks!
>
> Cheers,
> Florian

Dan vd Ster was kind enough to pitch in an incredibly helpful off-list
reply, which I am taking the liberty to paraphrase here:

That "mysterious" OSD madness seems to be caused by NUMA zone reclaim,
which is enabled by default on Intel machines with recent kernels. It
can be disabled as follows:

echo 0 > /proc/sys/vm/zone_reclaim_mode

or of course, "sysctl -w vm.zone_reclaim_mode=0" or the corresponding
sysctl.conf entry.

On the machines affected, that seems to have removed the CPU pegging
issue, at least it has not reappeared for several days now.

Dan and Sage have discussed the issue recently in this thread:
http://www.spinics.net/lists/ceph-users/msg14914.html

Thanks a million to Dan.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com