Hi, I have been experiencing rapid outage events in the Ceph cluster. During these events I receive messages from slow ops, OSD downs, but at the same time it is operating. Magically everything is back to normal. These events usually last about 2 minutes. I couldn't find anything that could direct me to what causes these events, can you help me? I'm using Mimic (13.2.6) and CentOS7 on all nodes. Below is "ceph -s" and log of when event occurs: # ceph -s cluster: id: 4ea72929-6f9e-453a-8cd5-bb0712f6b874 health: HEALTH_OK services: mon: 2 daemons, quorum cmonitor,cmonitor2 mgr: cmonitor(active), standbys: cmonitor2 osd: 74 osds: 74 up, 74 in tcmu-runner: 10 daemons active data: pools: 7 pools, 3072 pgs objects: 22.17 M objects, 83 TiB usage: 225 TiB used, 203 TiB / 428 TiB avail pgs: 3063 active+clean 9 active+clean+scrubbing+deep ====================================== Log of event: 2020-08-05 18:00:00.000179 [INF] overall HEALTH_OK 2020-08-05 17:55:28.905024 [INF] Cluster is now healthy 2020-08-05 17:55:28.904975 [INF] Health check cleared: PG_DEGRADED (was: Degraded data redundancy: 1/60350974 objects degraded (0.000%), 1 pg degraded) 2020-08-05 17:55:27.746606 [WRN] Health check update: Degraded data redundancy: 1/60350974 objects degraded (0.000%), 1 pg degraded (PG_DEGRADED) 2020-08-05 17:55:22.745820 [WRN] Health check update: Degraded data redundancy: 55/60350897 objects degraded (0.000%), 26 pgs degraded, 1 pg undersized (PG_DEGRADED) 2020-08-05 17:55:17.744218 [WRN] Health check update: Degraded data redundancy: 123/60350666 objects degraded (0.000%), 63 pgs degraded (PG_DEGRADED) 2020-08-05 17:55:12.743568 [WRN] Health check update: Degraded data redundancy: 192/60350660 objects degraded (0.000%), 88 pgs degraded (PG_DEGRADED) 2020-08-05 17:55:07.741759 [WRN] Health check update: Degraded data redundancy: 290/60350737 objects degraded (0.000%), 117 pgs degraded (PG_DEGRADED) 2020-08-05 17:55:02.737913 [WRN] Health check update: Degraded data redundancy: 299/60350764 objects degraded (0.000%), 119 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:57.736694 [WRN] Health check update: Degraded data redundancy: 299/60350746 objects degraded (0.000%), 119 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:52.736132 [WRN] Health check update: Degraded data redundancy: 299/60350731 objects degraded (0.000%), 119 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:47.735612 [WRN] Health check update: Degraded data redundancy: 299/60350689 objects degraded (0.000%), 119 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:42.734877 [WRN] Health check update: Degraded data redundancy: 301/60350677 objects degraded (0.000%), 120 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:38.210906 [INF] Health check cleared: SLOW_OPS (was: 35 slow ops, oldest one blocked for 1954017 sec, daemons [mon.cmonitor,mon.cmonitor2] have slow ops.) 2020-08-05 17:54:37.734218 [WRN] Health check update: 35 slow ops, oldest one blocked for 1954017 sec, daemons [mon.cmonitor,mon.cmonitor2] have slow ops. (SLOW_OPS) 2020-08-05 17:54:37.734132 [WRN] Health check update: Degraded data redundancy: 380/60350611 objects degraded (0.001%), 154 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:34.171483 [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 3 pgs inactive, 6 pgs peering) 2020-08-05 17:54:32.733499 [WRN] Health check update: Degraded data redundancy: 52121/60350544 objects degraded (0.086%), 211 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:27.080529 [WRN] Monitor daemon marked osd.72 down, but it is still running 2020-08-05 17:54:32.102889 [WRN] Health check failed: 60 slow ops, oldest one blocked for 1954017 sec, daemons [osd.16,osd.22,osd.23,osd.27,osd.28,osd.29,osd.30,osd.35,osd.48,osd.5]... have slow ops. (SLOW_OPS) 2020-08-05 17:54:32.102699 [WRN] Health check update: Reduced data availability: 3 pgs inactive, 6 pgs peering (PG_AVAILABILITY) 2020-08-05 17:54:27.951343 [INF] osd.72 192.168.200.25:6844/64565 boot 2020-08-05 17:54:27.935996 [INF] Health check cleared: OSD_DOWN (was: 1 osds down) 2020-08-05 17:54:27.732679 [WRN] Health check update: Degraded data redundancy: 1781748/60350443 objects degraded (2.952%), 381 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:26.916269 [WRN] Health check update: Reduced data availability: 5 pgs inactive, 10 pgs peering, 18 pgs incomplete (PG_AVAILABILITY) 2020-08-05 17:54:19.712714 [WRN] Monitor daemon marked osd.71 down, but it is still running 2020-08-05 17:54:22.716043 [WRN] Health check update: Degraded data redundancy: 4057630/60350485 objects degraded (6.723%), 566 pgs degraded (PG_DEGRADED) 2020-08-05 17:54:22.715939 [WRN] Health check update: 1 osds down (OSD_DOWN) 2020-08-05 17:54:14.042363 [WRN] Monitor daemon marked osd.62 down, but it is still running 2020-08-05 17:54:20.858582 [WRN] Health check update: Reduced data availability: 8 pgs inactive, 15 pgs peering, 37 pgs incomplete (PG_AVAILABILITY) _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx