Ceph outage - monitoring options

Adrian Saul <Adrian.Saul@xxxxxxxxxxxxxxxxx> · Tue, 22 Nov 2016 07:32:54 +0000

Hi All,
  We have a jewel cluster (10.2.1) that we built up in a POC state (2 clients also being mons, 12 SSD OSDs on 3 hosts, 20 SATA OSDs on 3 hosts).   We have connected up our "prod" environment to it and performed a migration for all the OSDs so it is now 114 OSDs (36 SSD, 78 NL-SAS with another 26 waiting in for replacement of 2 DOA journals).

The migration was relatively clean, except for when I removed the old hosts from the crush map - even though the OSDs were already out and removed, the host entries still had weight and the rebalancing when the buckets were removed kicked causes some sort of lockup with the kernel RBD clients.  Once they were kicked the main issue settled down and all was good.

We then had one SAS OSD machine have a single network flap for 8 seconds where it lost all network (unknown by me until later).   By the time I was looking at the issue it was like a ghostly silence - no OSDs down, none out, no failed PGs, no repairing - just blocked requests and nothing else.   After jumping between multiple machines and finding nothing of note I got desperate and restarted all OSDs on a host by host basis.  That resulted in a number of PGs becoming inactive, but no recovery or any sort of other improvement and the blocked requests continued.

I tried digging blocked requests out from random OSDs - most seemed to just say "waiting for peered" with no other real information.  After a while I started finding a general pattern pointing to one machine as being a peer for some of the pgs that were out  (I wish I had done ceph health detail earlier, because it would have clued me into a list of PGs owned by that host faster).  Finding nothing wrong and having seen the OSDs already been restarted (monitors etc saw them go down and back up fine but the blocked requests remained), I chose to reboot the server.

As soon as that was done its like the cluster jumped back to life and discovered it was broken and started repairing.  Once the down host was back online it joined again and the cluster recovered quickly back as if nothing had happened.

When I later dug into the logs I found the network flap, followed by the building blocked requests and by most of the OSDs on that host complaining of heartbeat_check failures.

Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0a:00.0 enp10s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0b:00.0 enp11s0: Link DOWN
Nov 21 16:38:31 ceph-glb-sata-06 kernel: enic 0000:0c:00.0 enp12s0: Link DOWN
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0a:00.0 enp10s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0b:00.0 enp11s0: Link UP
Nov 21 16:38:39 ceph-glb-sata-06 kernel: enic 0000:0c:00.0 enp12s0: Link UP

Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900016 7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.68 since back 2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 16:38:25.900010)
Nov 21 16:38:45 ceph-glb-sata-06 ceph-osd: 2016-11-21 16:38:45.900038 7f66a918c700 -1 osd.103 69295 heartbeat_check: no reply from osd.70 since back 2016-11-21 16:38:25.707045 front 2016-11-21 16:38:43.908106 (cutoff 2016-11-21 16:38:25.900010)

2016-11-21 16:38:49.751243 7f138a653700 -1 osd.98 69295 heartbeat_check: no reply from osd.74 since back 2016-11-21 16:38:29.391091 front 2016-11-21 16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)
2016-11-21 16:38:49.751264 7f138a653700 -1 osd.98 69295 heartbeat_check: no reply from osd.77 since back 2016-11-21 16:38:29.391091 front 2016-11-21 16:38:47.293162 (cutoff 2016-11-21 16:38:29.751238)

In some of the blocked ops I found references to waiting on rw locks:

2016-11-21 16:43:42.787670 7f138a653700  0 log_channel(cluster) log [WRN] : 4 slow requests, 4 included below; oldest blocked for > 120.843670 secs
2016-11-21 16:43:42.787673 7f138a653700  0 log_channel(cluster) do_log log to syslog
2016-11-21 16:43:42.787707 7f138a653700  0 log_channel(cluster) log [WRN] : slow request 120.843670 seconds old, received at 2016-11-21 16:41:41.943943: osd_op(client.2714100.1:69471470 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 [set-alloc-hint object_size 4194304 write_size 4194304,write 3683840~510464] snapc 0=[] ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently waiting for subops from 89
2016-11-21 16:43:42.787710 7f138a653700  0 log_channel(cluster) do_log log to syslog
2016-11-21 16:43:42.787725 7f138a653700  0 log_channel(cluster) log [WRN] : slow request 60.181839 seconds old, received at 2016-11-21 16:42:42.605773: osd_op(client.2714100.1:69528164 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 [set-alloc-hint object_size 4194304 write_size 4194304,write 3683840~510464] snapc 0=[] ondisk+write+ignore_cache+ignore_overlay+known_if_redirected e69295) currently waiting for subops from 89
2016-11-21 16:43:42.787740 7f138a653700  0 log_channel(cluster) do_log log to syslog
2016-11-21 16:43:42.787811 7f138a653700  0 log_channel(cluster) log [WRN] : slow request 60.181151 seconds old, received at 2016-11-21 16:42:42.606462: osd_op(osd.22.51681:2352097 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 [list-snaps] snapc 0=[] ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e69295) currently waiting for rw locks
2016-11-21 16:43:42.787813 7f138a653700  0 log_channel(cluster) do_log log to syslog
2016-11-21 16:43:42.787875 7f138a653700  0 log_channel(cluster) log [WRN] : slow request 60.181123 seconds old, received at 2016-11-21 16:42:42.606490: osd_op(osd.22.51681:2352098 1.3140a3e7 rbd_data.159a26238e1f29.0000000000018502 [copy-get max 8388608] snapc 0=[] ack+read+rwordered+ignore_cache+ignore_overlay+map_snap_clone+known_if_redirected e69295) currently waiting for rw locks
2016-11-21 16:43:42.787877 7f138a653700  0 log_channel(cluster) do_log log to syslog

I am assuming there is something underlying here that broke network connectivity, however at the time I was able to connect to the machine fine, able to restart the services and have the monitors see that activity and also ping through the replication and public networks.   Granted that its not probably cephs fault why connectivity was broken. However....

Why when the OSDs were restarted would they go straight back into that failed state where they were still blocked even though they came up and in?
Why would other OSDs not have marked them as bad or unresponsive at that point given many were waiting on blocked operations?  Is the fact they could connect enough or are there other conditions that would not cause them to be marked as out by other OSDs?
Is there any recommended configuration that would help improve detect the case when an OSD is unresponsive and more quickly move to out it from the osdmap to keep the cluster operational  (timeout tuning, flags to be set etc?)

Any help appreciated.  It's a little scary kicking into production and having an outage that I cant explain why cephs redundancy didn't kick in.

Cheers,
 Adrian

Confidentiality: This email and any attachments are confidential and may be subject to copyright, legal or some other professional privilege. They are intended solely for the attention and use of the named addressee(s). They may only be copied, distributed or disclosed with the consent of the copyright owner. If you have received this email by mistake or by breach of the confidentiality clause, please notify the sender immediately by return email and delete or destroy all copies of the email. Any confidentiality, privilege or copyright is not waived or lost because this email has been sent to you by mistake.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com