On 02/18/2015 02:19 PM, Florian Haas wrote:
Hey everyone, I must confess I'm still not fully understanding this problem and don't exactly know where to start digging deeper, but perhaps other users have seen this and/or it rings a bell. System info: Ceph giant on CentOS 7; approx. 240 OSDs, 6 pools using 2 different rulesets where the problem applies to hosts and PGs using a bog-standard default crushmap. Symptom: out of the blue, ceph-osd processes on a single OSD node start going to 100% CPU utilization. The problems turns so bad that the machine is effectively becoming CPU bound and can't cope with any client requests anymore. Stopping and restarting all OSDs brings the problem right back, as does rebooting the machine — right after ceph-osd processes start, CPU utilization shoots up again. Stopping and marking out several OSDs on the machine makes the problem go away but obviously causes massive backfilling. All the logs show while CPU utilization is implausibly high are slow requests (which would be expected in a system that can barely do anything). Now I've seen issues like this before on dumpling and firefly, but besides the fact that they have all been addressed and should now be fixed, they always involved the prior mass removal of RBD snapshots. This system only used a handful of snapshots in testing, and is presently not using any snapshots at all. I'll be spending some time looking for clues in the log files of the OSDs that were shut down which caused the problem to go away, but if this sounds familiar to anyone willing to offer clues, I'd be more than interested. :) Thanks!
Hi Florian, Does a quick perf top tell you anything useful?
Cheers, Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com