Significantly increased CPU footprint on OSDs after Hammer -> Jewel upgrade, OSDs occasionally wrongly marked as down

Trygve Vea <trygve.vea@xxxxxxxxxxxxxxxxxx> · Wed, 26 Oct 2016 14:31:03 +0200 (CEST)

Hi,

We have two Ceph-clusters, one exposing pools both for RGW and RBD (OpenStack/KVM) pools - and one only for RBD.

After upgrading both to Jewel, we have seen a significantly increased CPU footprint on the OSDs that are a part of the cluster which includes RGW.

This graph illustrates this: http://i.imgur.com/Z81LW5Y.png

I wonder if anyone else have seen this behaviour, and if this is a symptom of a regression --- or if this was to be expected after moving from hammer to jewel.

I have also observed that an OSD will occasionally be marked as down, but will recover by itself.

This manifests itself in the osd logs as a series of lines along this:

2016-10-26 06:32:20.106602 7fa57a942700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7fa575938700' had timed out after 15

Some slow requests may be observed:

2016-10-26 06:32:35.899597 7fa5aa41b700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.905777 secs
2016-10-26 06:32:35.899605 7fa5aa41b700  0 log_channel(cluster) log [WRN] : slow request 30.905777 seconds old, received at 2016-10-26 06:32:04.993791: replica scrub(pg: 3.2e,from:0'0,to:27810'772752,epoch:28538,start:3:74000000::::head,end:3:7400039b::::0,chunky:1,deep:1,seed:4294967295,version:6) currently reached_pg

Some failing heartbeat_checks (usually only from a single osd):

2016-10-26 06:32:39.323412 7fa56f92c700 -1 osd.19 28538 heartbeat_check: no reply from osd.15 since back 2016-10-26 06:32:19.017249 front 2016-10-26 06:32:19.017249 (cutoff 2016-10-26 06:32:19.323409)

A bunch of these (with the remote address targetting different osds):

2016-10-26 06:32:45.522391 7fa598ec0700  0 -- 169.254.169.254:6812/151031797 >> 169.254.169.255:6802/41700 pipe(0x7fa5ebba7400 sd=160 :6812 s=2 pgs=4298 cs=1 l=0 c=0x7fa5d7c26400).fault with nothing to send, going to standby

2016-10-26 06:32:45.525524 7fa5a5158700  0 log_channel(cluster) log [WRN] : map e28540 wrongly marked me down

Followed by repeering, and then everything is fine again.

I wonder if anyone have been suffering from similar behaviour, if this is a bug (known or unknown).  One detail to keep in mind is that the osds for the rgw pools store replicas on different physical sites.  However, we have no reason to believe that saturation or high latency is a problem.

Regards
-- 
Trygve Vea
Redpill Linpro AS
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com