Re: ceph-osd pegging CPU on giant, no snapshots involved this time

Mark Nelson <mnelson@xxxxxxxxxx> · Mon, 23 Feb 2015 12:55:56 -0600

On 02/23/2015 12:21 PM, Florian Haas wrote:
On Wed, Feb 18, 2015 at 9:19 PM, Florian Haas <florian@xxxxxxxxxxx> wrote:
Hey everyone,

I must confess I'm still not fully understanding this problem and
don't exactly know where to start digging deeper, but perhaps other
users have seen this and/or it rings a bell.

System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2
different rulesets where the problem applies to hosts and PGs using a
bog-standard default crushmap.

Symptom: out of the blue, ceph-osd processes on a single OSD node
start going to 100% CPU utilization. The problems turns so bad that
the machine is effectively becoming CPU bound and can't cope with any
client requests anymore. Stopping and restarting all OSDs brings the
problem right back, as does rebooting the machine — right after
ceph-osd processes start, CPU utilization shoots up again. Stopping
and marking out several OSDs on the machine makes the problem go away
but obviously causes massive backfilling. All the logs show while CPU
utilization is implausibly high are slow requests (which would be
expected in a system that can barely do anything).

Now I've seen issues like this before on dumpling and firefly, but
besides the fact that they have all been addressed and should now be
fixed, they always involved the prior mass removal of RBD snapshots.
This system only used a handful of snapshots in testing, and is
presently not using any snapshots at all.

I'll be spending some time looking for clues in the log files of the
OSDs that were shut down which caused the problem to go away, but if
this sounds familiar to anyone willing to offer clues, I'd be more
than interested. :) Thanks!

Cheers,
Florian

Dan vd Ster was kind enough to pitch in an incredibly helpful off-list
reply, which I am taking the liberty to paraphrase here:

That "mysterious" OSD madness seems to be caused by NUMA zone reclaim,
which is enabled by default on Intel machines with recent kernels. It
can be disabled as follows:

echo 0 > /proc/sys/vm/zone_reclaim_mode

or of course, "sysctl -w vm.zone_reclaim_mode=0" or the corresponding
sysctl.conf entry.

On the machines affected, that seems to have removed the CPU pegging
issue, at least it has not reappeared for several days now.

Dan and Sage have discussed the issue recently in this thread:
http://www.spinics.net/lists/ceph-users/msg14914.html

Thanks a million to Dan.

Ugh, this isn't the first time this has bit us, and I think we've seen 
it in other contexts as well and not realized it.  Thank you both for 
catching this.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com