Hi, I have a problem with crashing OSD daemons in our Ceph 15.2.6 cluster . The problem was temporarily resolved by disabling scrub and deep-scrub. All PGs are active+clean. After a few days I tried to enable scrubbing again, but the problem persists. OSD with high latencies, PG laggy, osd not responding, OSD was marked down and the cluster is not usable. Problem appeared after failover when 10 OSD was marked as out. I will appreciate any help or advice, how to resolve this problem. Regards, Miroslav 2020-12-10T21:39:57.721883+0100 osd.1 (osd.1) 5 : cluster [DBG] 18.11 deep-scrub starts 2020-12-10T21:39:57.880861+0100 osd.1 (osd.1) 6 : cluster [DBG] 18.11 deep-scrub ok 2020-12-10T21:39:58.713422+0100 osd.1 (osd.1) 7 : cluster [DBG] 43.5 scrub starts 2020-12-10T21:39:58.719372+0100 osd.1 (osd.1) 8 : cluster [DBG] 43.5 scrub ok 2020-12-10T21:39:59.296962+0100 mgr.pve2-prg2a (mgr.91575377) 118746 : cluster [DBG] pgmap v119000: 1737 pgs: 3 active+clean+laggy, 3 active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 117 KiB/s rd, 2.5 MiB/s wr, 269 op/s 2020-12-10T21:40:00.088421+0100 osd.29 (osd.29) 74 : cluster [DBG] 1.13b deep-scrub starts 2020-12-10T21:40:01.300373+0100 mgr.pve2-prg2a (mgr.91575377) 118747 : cluster [DBG] pgmap v119001: 1737 pgs: 3 active+clean+laggy, 3 active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 202 op/s 2020-12-10T21:40:02.681058+0100 osd.34 (osd.34) 13 : cluster [DBG] 1.a2 deep-scrub ok 2020-12-10T21:40:03.304009+0100 mgr.pve2-prg2a (mgr.91575377) 118749 : cluster [DBG] pgmap v119002: 1737 pgs: 3 active+clean+laggy, 3 active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 198 op/s 2020-12-10T21:40:05.316233+0100 mgr.pve2-prg2a (mgr.91575377) 118750 : cluster [DBG] pgmap v119003: 1737 pgs: 6 active+clean+laggy, 3 active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 150 KiB/s rd, 3.0 MiB/s wr, 249 op/s 2020-12-10T21:40:07.319643+0100 mgr.pve2-prg2a (mgr.91575377) 118751 : cluster [DBG] pgmap v119004: 1737 pgs: 6 active+clean+laggy, 3 active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 142 KiB/s rd, 2.3 MiB/s wr, 212 op/s 2020-12-10T21:40:15.523134+0100 mon.pve1-prg2a (mon.0) 125943 : cluster [DBG] osd.4 reported failed by osd.24 2020-12-10T21:40:15.523325+0100 mon.pve1-prg2a (mon.0) 125944 : cluster [DBG] osd.39 reported failed by osd.24 2020-12-10T21:40:16.112299+0100 mon.pve1-prg2a (mon.0) 125946 : cluster [WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, osd.8 has slow ops (SLOW_OPS) 2020-12-10T21:40:16.202867+0100 mon.pve1-prg2a (mon.0) 125947 : cluster [DBG] osd.4 reported failed by osd.34 2020-12-10T21:40:16.202986+0100 mon.pve1-prg2a (mon.0) 125948 : cluster [INF] osd.4 failed (root=default,host=pve1-prg2a) (2 reporters from different host after 24.000267 >= grace 22.361677) 2020-12-10T21:40:16.373925+0100 mon.pve1-prg2a (mon.0) 125949 : cluster [DBG] osd.39 reported failed by osd.6 2020-12-10T21:40:16.865608+0100 mon.pve1-prg2a (mon.0) 125951 : cluster [DBG] osd.39 reported failed by osd.8 2020-12-10T21:40:17.125917+0100 mon.pve1-prg2a (mon.0) 125952 : cluster [WRN] Health check failed: 1 osds down (OSD_DOWN) 2020-12-10T21:40:17.139006+0100 mon.pve1-prg2a (mon.0) 125953 : cluster [DBG] osdmap e12819: 40 total, 39 up, 30 in 2020-12-10T21:40:17.140248+0100 mon.pve1-prg2a (mon.0) 125954 : cluster [DBG] osd.39 reported failed by osd.21 2020-12-10T21:40:17.344244+0100 mgr.pve2-prg2a (mgr.91575377) 118757 : cluster [DBG] pgmap v119012: 1737 pgs: 9 peering, 61 stale+active+clean, 1 active+clean+scrubbing+deep, 7 active+clean+laggy, 1659 active+clean; 3.3 TiB data, 7.8 TiB used, 14 TiB / 22 TiB avail; 44 KiB/s rd, 2.5 MiB/s wr, 107 op/s 2020-12-10T21:40:17.378069+0100 mon.pve1-prg2a (mon.0) 125955 : cluster [DBG] osd.39 reported failed by osd.26 2020-12-10T21:40:17.424429+0100 mon.pve1-prg2a (mon.0) 125956 : cluster [DBG] osd.39 reported failed by osd.18 2020-12-10T21:40:17.829447+0100 mon.pve1-prg2a (mon.0) 125957 : cluster [DBG] osd.39 reported failed by osd.36 2020-12-10T21:40:17.847373+0100 mon.pve1-prg2a (mon.0) 125958 : cluster [DBG] osd.39 reported failed by osd.1 2020-12-10T21:40:17.858371+0100 mon.pve1-prg2a (mon.0) 125959 : cluster [DBG] osd.39 reported failed by osd.17 2020-12-10T21:40:17.915755+0100 mon.pve1-prg2a (mon.0) 125960 : cluster [DBG] osd.39 reported failed by osd.28 2020-12-10T21:40:24.151192+0100 mon.pve1-prg2a (mon.0) 125986 : cluster [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg peering) 2020-12-10T21:40:24.608038+0100 mon.pve1-prg2a (mon.0) 125987 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 37 sec, osd.8 has slow ops (SLOW_OPS) 2020-12-10T21:40:25.375322+0100 mgr.pve2-prg2a (mgr.91575377) 118761 : cluster [DBG] pgmap v119017: 1737 pgs: 1 active+clean+scrubbing+deep, 6 active+undersized, 40 active+undersized+wait, 13 active+undersized+degraded, 101 active+undersized+degraded+wait, 3 stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 4.0 MiB/s wr, 179 op/s; 74043/2676420 objects degraded (2.766%); 622 KiB/s, 0 objects/s recovering 2020-12-10T21:40:27.379311+0100 mgr.pve2-prg2a (mgr.91575377) 118762 : cluster [DBG] pgmap v119018: 1737 pgs: 1 active+clean+scrubbing+deep, 6 active+undersized, 40 active+undersized+wait, 13 active+undersized+degraded, 101 active+undersized+degraded+wait, 3 stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 3.6 MiB/s wr, 154 op/s; 74043/2676420 objects degraded (2.766%); 510 KiB/s, 0 objects/s recovering 2020-12-10T21:40:29.391905+0100 mgr.pve2-prg2a (mgr.91575377) 118763 : cluster [DBG] pgmap v119019: 1737 pgs: 1 active+clean+scrubbing+deep, 11 active+undersized, 35 active+undersized+wait, 21 active+undersized+degraded, 93 active+undersized+degraded+wait, 3 stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 3.9 MiB/s wr, 170 op/s; 74043/2676420 objects degraded (2.766%); 456 KiB/s, 0 objects/s recovering 2020-12-10T21:40:29.609801+0100 mon.pve1-prg2a (mon.0) 125991 : cluster [WRN] Health check update: Degraded data redundancy: 74043/2676420 objects degraded (2.766%), 114 pgs degraded (PG_DEGRADED) 2020-12-10T21:40:29.609845+0100 mon.pve1-prg2a (mon.0) 125992 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 42 sec, osd.8 has slow ops (SLOW_OPS) 2020-12-10T21:40:31.395715+0100 mgr.pve2-prg2a (mgr.91575377) 118764 : cluster [DBG] pgmap v119020: 1737 pgs: 1 active+clean+scrubbing+deep, 11 active+undersized, 35 active+undersized+wait, 21 active+undersized+degraded, 93 active+undersized+degraded+wait, 3 stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 154 KiB/s rd, 2.1 MiB/s wr, 151 op/s; 74043/2676420 objects degraded (2.766%); 425 KiB/s, 0 objects/s recovering 2020-12-10T21:40:32.797773+0100 osd.8 (osd.8) 19 : cluster [WRN] slow request osd_op(client.91721949.0:169935 1.3b 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) initiated 2020-12-10T21:39:59.813054+0100 currently delayed 2020-12-10T21:40:33.399277+0100 mgr.pve2-prg2a (mgr.91575377) 118765 : cluster [DBG] pgmap v119021: 1737 pgs: 1 active+clean+scrubbing+deep, 13 active+undersized, 33 active+undersized+wait, 28 active+undersized+degraded, 86 active+undersized+degraded+wait, 3 stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB data, 7.8 TiB used, 21 TiB / 29 TiB avail; 156 KiB/s rd, 1.6 MiB/s wr, 144 op/s; 74043/2676420 objects degraded (2.766%) 2020-12-10T21:40:33.841597+0100 osd.8 (osd.8) 20 : cluster [WRN] slow request osd_op(client.91721949.0:169935 1.3b 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) initiated 2020-12-10T21:39:59.813054+0100 currently delayed 2020-12-10T21:40:34.611650+0100 mon.pve1-prg2a (mon.0) 125996 : cluster [WRN] Health check update: 0 slow ops, oldest one blocked for 47 sec, daemons [osd.22,osd.8] have slow ops. (SLOW_OPS) 2020-12-10T21:40:34.612166+0100 mon.pve1-prg2a (mon.0) 125997 : cluster [INF] osd.39 failed (root=default,host=pve5-prg2a) (5 reporters from different host after 39.438374 >= grace 37.721344) 2020-12-10T21:40:34.615286+0100 mon.pve1-prg2a (mon.0) 125998 : cluster [WRN] Health check update: 2 osds down (OSD_DOWN) 2020-12-10T21:40:34.621801+0100 mon.pve1-prg2a (mon.0) 125999 : cluster [DBG] osdmap e12821: 40 total, 38 up, 30 in 2020-12-10T21:40:34.880626+0100 osd.8 (osd.8) 21 : cluster [WRN] slow request osd_op(client.91721949.0:169935 1.3b 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) initiated 2020-12-10T21:39:59.813054+0100 currently delayed 8/0 sis=12819) [28,35] r=-1 lpr=12819 pi=[12507,12819)/1 crt=12818'17226265 lcod 12818'17226264 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000020 0 0.000000 -823> 2020-12-10T21:40:44.079+0100 7f5310751700 5 osd.4 pg_epoch: 12822 pg[1.cc( v 12818'17226265 (12812'17224439,12818'17226265] local-lis/les=12507/12508 n=1566 ec=171/171 lis/c=12507/12507 les/c/f=12508/12508/0 sis=12819) [28,35] r=-1 lpr=12819 pi=[12507,12819)/1 crt=12818'17226265 lcod 12818'17226264 mlcod 0'0 unknown NOTIFY mbc={}] enter Started/Stray -822> 2020-12-10T21:40:44.191+0100 7f5324f7a700 5 prioritycache tune_memory target: 4294967296 mapped: 3494428672 unmapped: 383787008 heap: 3878215680 old mem: 2845415832 new mem: 2845415832 -821> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 -820> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5312f56700' had timed out after 15 -819> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 is_healthy false -- internal heartbeat failed -818> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 not healthy; waiting to boot -817> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 tick checking mon for new map -816> 2020-12-10T21:40:44.311+0100 7f530e74d700 5 osd.4 12822 heartbeat osd_stat(store_statfs(0x7c5f5b0000/0x2e36c0000/0xba40000000, data 0x3af1c2f453/0x3afd380000, compress 0x0/0x0/0x0, omap 0x30256c, meta 0x2e33bda94), peers [3,5,6,7,8,9,13,20,28,36] op hist [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22]) -815> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: tick -814> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: _check_auth_rotating have uptodate secrets (they expire after 2020-12-10T21:40:14.409098+0100) -813> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client log_queue is 2 last_log 7 sent 5 num 2 unsent 2 sending 2 -812> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client will send 2020-12-10T21:40:44.063147+0100 osd.4 (osd.4) 6 : cluster [WRN] Monitor daemon marked osd.4 down, but it is still running -811> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client will send 2020-12-10T21:40:44.063157+0100 osd.4 (osd.4) 7 : cluster [DBG] map e12822 wrongly marked me down at e12819 -810> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: _send_mon_message to mon.pve1-prg2a at v2:10.104.200.11:3300/0 -809> 2020-12-10T21:40:45.191+0100 7f5324f7a700 5 prioritycache tune_memory target: 4294967296 mapped: 3494428672 unmapped: 383787008 heap: 3878215680 old mem: 2845415832 new mem: 2845415832 -808> 2020-12-10T21:40:45.351+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 -807> 2020-12-10T21:40:45.351+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f5312f56700' had timed out after 15 -3> 2020-12-10T21:42:10.466+0100 7f5324f7a700 5 prioritycache tune_memory target: 4294967296 mapped: 3510689792 unmapped: 367525888 heap: 3878215680 old mem: 2845415832 new mem: 2845415832 -2> 2020-12-10T21:42:10.694+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 -1> 2020-12-10T21:42:10.694+0100 7f532ca01700 1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had suicide timed out after 150 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx