Cluster in bad shape, seemingly endless cycle of OSDs failed, then marked down, then booted, then failed again

Bryan Banister <bbanister@xxxxxxxxxxxxxxx> · Tue, 17 Jul 2018 17:00:05 +0000

Hi all,

We’re still very new to managing Ceph and seem to have cluster that is in an endless loop of failing OSDs, then marking them down, then booting them again:

Here are some example logs:
2018-07-17 16:48:28.976673 mon.rook-ceph-mon7 [INF] osd.83 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491973 >= grace 20.010293)
2018-07-17 16:48:28.976730 mon.rook-ceph-mon7 [INF] osd.84 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491916 >= grace 20.010293)
2018-07-17 16:48:28.976785 mon.rook-ceph-mon7 [INF] osd.85 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491870 >= grace 20.011151)
2018-07-17 16:48:28.976843 mon.rook-ceph-mon7 [INF] osd.86 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 61.491828 >= grace 20.010293)
2018-07-17 16:48:28.976890 mon.rook-ceph-mon7 [INF] Marking osd.1 out (has been down for 605 seconds)
2018-07-17 16:48:28.976913 mon.rook-ceph-mon7 [INF] Marking osd.2 out (has been down for 605 seconds)
2018-07-17 16:48:28.976933 mon.rook-ceph-mon7 [INF] Marking osd.3 out (has been down for 605 seconds)
2018-07-17 16:48:28.976954 mon.rook-ceph-mon7 [INF] Marking osd.4 out (has been down for 605 seconds)
2018-07-17 16:48:28.976979 mon.rook-ceph-mon7 [INF] Marking osd.9 out (has been down for 605 seconds)
2018-07-17 16:48:28.977000 mon.rook-ceph-mon7 [INF] Marking osd.10 out (has been down for 605 seconds)
2018-07-17 16:48:28.977020 mon.rook-ceph-mon7 [INF] Marking osd.11 out (has been down for 605 seconds)
2018-07-17 16:48:28.977040 mon.rook-ceph-mon7 [INF] Marking osd.12 out (has been down for 605 seconds)
2018-07-17 16:48:28.977059 mon.rook-ceph-mon7 [INF] Marking osd.13 out (has been down for 605 seconds)
2018-07-17 16:48:28.977079 mon.rook-ceph-mon7 [INF] Marking osd.14 out (has been down for 605 seconds)
2018-07-17 16:48:30.889316 mon.rook-ceph-mon7 [INF] osd.55 7.129.218.12:6920/90761 boot
2018-07-17 16:48:31.113052 mon.rook-ceph-mon7 [WRN] Health check update: 4946/8854434 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:31.113087 mon.rook-ceph-mon7 [WRN] Health check update: Degraded data redundancy: 7951/8854434 objects degraded (0.090%), 88 pgs degraded, 273 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:32.763546 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10439 pgs inactive, 8994 pgs down, 1639 pgs peering, 88 pgs incomplete, 3430 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:32.763578 mon.rook-ceph-mon7 [WRN] Health check update: 29 slow requests are blocked > 32 sec (REQUEST_SLOW)
2018-07-17 16:48:34.096178 mon.rook-ceph-mon7 [INF] osd.88 failed (root=default,host=carg-kubelet-osd04) (3 reporters from different host after 66.612054 >= grace 20.010283)
2018-07-17 16:48:34.108020 mon.rook-ceph-mon7 [WRN] Health check update: 112 osds down (OSD_DOWN)
2018-07-17 16:48:38.736108 mon.rook-ceph-mon7 [WRN] Health check update: 4946/8843715 objects misplaced (0.056%) (OBJECT_MISPLACED)
2018-07-17 16:48:38.736140 mon.rook-ceph-mon7 [WRN] Health check update: Reduced data availability: 10415 pgs inactive, 9000 pgs down, 1635 pgs peering, 88 pgs incomplete, 3418 pgs stale (PG_AVAILABILITY)
2018-07-17 16:48:38.736166 mon.rook-ceph-mon7 [WRN] Health check update: Degraded data redundancy: 7949/8843715 objects degraded (0.090%), 86 pgs degraded, 267 pgs undersized (PG_DEGRADED)
2018-07-17 16:48:40.430146 mon.rook-ceph-mon7 [WRN] Health check update: 111 osds down (OSD_DOWN)
2018-07-17 16:48:40.812579 mon.rook-ceph-mon7 [INF] osd.117 7.129.217.10:6833/98090 boot
2018-07-17 16:48:42.427204 mon.rook-ceph-mon7 [INF] osd.115 7.129.217.10:6940/98114 boot
2018-07-17 16:48:42.427297 mon.rook-ceph-mon7 [INF] osd.100 7.129.217.10:6899/98091 boot
2018-07-17 16:48:42.427502 mon.rook-ceph-mon7 [INF] osd.95 7.129.217.10:6901/98092 boot

Not sure this is going to fix itself.  Any ideas on how to handle this situation??

Thanks in advance!
-Bryan

Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination,
 or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees
 as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of
 transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related
 purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company’s treatment of personal data, please email datarequests@xxxxxxxxxxxxxxx.

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com