Hi Igor, thank you. Yes you are right. It seems that the background removal is completed. The correct way to fix it is "ceph-kvstore-tool bluestore-kv <path-to-osd> compact" to all OSD (one by one)? Regards, Miroslav pá 11. 12. 2020 v 14:19 odesílatel Igor Fedotov <ifedotov@xxxxxxx> napsal: > Hi Miroslav, > > haven't you performed massive data removal (PG migration) recently? > > If so you might want to apply manual DB compaction to your OSDs. > > The positive effect might be just temporary if background removals are > still in progress though.. > > > See https://tracker.ceph.com/issues/47044 and the references for more > info on the issue. > > > Thanks, > > Igor > > > > > On 12/11/2020 12:50 PM, Miroslav Boháč wrote: > > Hi, > > > > I have a problem with crashing OSD daemons in our Ceph 15.2.6 cluster . > The > > problem was temporarily resolved by disabling scrub and deep-scrub. All > PGs > > are active+clean. After a few days I tried to enable scrubbing again, but > > the problem persists. OSD with high latencies, PG laggy, osd not > > responding, OSD was marked down and the cluster is not usable. Problem > > appeared after failover when 10 OSD was marked as out. > > > > I will appreciate any help or advice, how to resolve this problem. > > > > Regards, > > Miroslav > > > > > > > > > > > > 2020-12-10T21:39:57.721883+0100 osd.1 (osd.1) 5 : cluster [DBG] 18.11 > > deep-scrub starts > > > > 2020-12-10T21:39:57.880861+0100 osd.1 (osd.1) 6 : cluster [DBG] 18.11 > > deep-scrub ok > > > > 2020-12-10T21:39:58.713422+0100 osd.1 (osd.1) 7 : cluster [DBG] 43.5 > scrub > > starts > > > > 2020-12-10T21:39:58.719372+0100 osd.1 (osd.1) 8 : cluster [DBG] 43.5 > scrub > > ok > > > > 2020-12-10T21:39:59.296962+0100 mgr.pve2-prg2a (mgr.91575377) 118746 : > > cluster [DBG] pgmap v119000: 1737 pgs: 3 active+clean+laggy, 3 > > active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB > used, > > 21 TiB / 29 TiB avail; 117 KiB/s rd, 2.5 MiB/s wr, 269 op/s > > > > 2020-12-10T21:40:00.088421+0100 osd.29 (osd.29) 74 : cluster [DBG] 1.13b > > deep-scrub starts > > > > 2020-12-10T21:40:01.300373+0100 mgr.pve2-prg2a (mgr.91575377) 118747 : > > cluster [DBG] pgmap v119001: 1737 pgs: 3 active+clean+laggy, 3 > > active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB > used, > > 21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 202 op/s > > > > 2020-12-10T21:40:02.681058+0100 osd.34 (osd.34) 13 : cluster [DBG] 1.a2 > > deep-scrub ok > > > > 2020-12-10T21:40:03.304009+0100 mgr.pve2-prg2a (mgr.91575377) 118749 : > > cluster [DBG] pgmap v119002: 1737 pgs: 3 active+clean+laggy, 3 > > active+clean+scrubbing+deep, 1731 active+clean; 3.3 TiB data, 7.8 TiB > used, > > 21 TiB / 29 TiB avail; 101 KiB/s rd, 1.9 MiB/s wr, 198 op/s > > > > 2020-12-10T21:40:05.316233+0100 mgr.pve2-prg2a (mgr.91575377) 118750 : > > cluster [DBG] pgmap v119003: 1737 pgs: 6 active+clean+laggy, 3 > > active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB > used, > > 21 TiB / 29 TiB avail; 150 KiB/s rd, 3.0 MiB/s wr, 249 op/s > > > > 2020-12-10T21:40:07.319643+0100 mgr.pve2-prg2a (mgr.91575377) 118751 : > > cluster [DBG] pgmap v119004: 1737 pgs: 6 active+clean+laggy, 3 > > active+clean+scrubbing+deep, 1728 active+clean; 3.3 TiB data, 7.8 TiB > used, > > 21 TiB / 29 TiB avail; 142 KiB/s rd, 2.3 MiB/s wr, 212 op/s > > > > > > > > > > > > 2020-12-10T21:40:15.523134+0100 mon.pve1-prg2a (mon.0) 125943 : cluster > > [DBG] osd.4 reported failed by osd.24 > > > > 2020-12-10T21:40:15.523325+0100 mon.pve1-prg2a (mon.0) 125944 : cluster > > [DBG] osd.39 reported failed by osd.24 > > > > 2020-12-10T21:40:16.112299+0100 mon.pve1-prg2a (mon.0) 125946 : cluster > > [WRN] Health check failed: 0 slow ops, oldest one blocked for 32 sec, > osd.8 > > has slow ops (SLOW_OPS) > > > > 2020-12-10T21:40:16.202867+0100 mon.pve1-prg2a (mon.0) 125947 : cluster > > [DBG] osd.4 reported failed by osd.34 > > > > 2020-12-10T21:40:16.202986+0100 mon.pve1-prg2a (mon.0) 125948 : cluster > > [INF] osd.4 failed (root=default,host=pve1-prg2a) (2 reporters from > > different host after 24.000267 >= grace 22.361677) > > > > 2020-12-10T21:40:16.373925+0100 mon.pve1-prg2a (mon.0) 125949 : cluster > > [DBG] osd.39 reported failed by osd.6 > > > > 2020-12-10T21:40:16.865608+0100 mon.pve1-prg2a (mon.0) 125951 : cluster > > [DBG] osd.39 reported failed by osd.8 > > > > 2020-12-10T21:40:17.125917+0100 mon.pve1-prg2a (mon.0) 125952 : cluster > > [WRN] Health check failed: 1 osds down (OSD_DOWN) > > > > 2020-12-10T21:40:17.139006+0100 mon.pve1-prg2a (mon.0) 125953 : cluster > > [DBG] osdmap e12819: 40 total, 39 up, 30 in > > > > 2020-12-10T21:40:17.140248+0100 mon.pve1-prg2a (mon.0) 125954 : cluster > > [DBG] osd.39 reported failed by osd.21 > > > > 2020-12-10T21:40:17.344244+0100 mgr.pve2-prg2a (mgr.91575377) 118757 : > > cluster [DBG] pgmap v119012: 1737 pgs: 9 peering, 61 stale+active+clean, > 1 > > active+clean+scrubbing+deep, 7 active+clean+laggy, 1659 active+clean; 3.3 > > TiB data, 7.8 TiB used, 14 TiB / 22 TiB avail; 44 KiB/s rd, 2.5 MiB/s wr, > > 107 op/s > > > > 2020-12-10T21:40:17.378069+0100 mon.pve1-prg2a (mon.0) 125955 : cluster > > [DBG] osd.39 reported failed by osd.26 > > > > 2020-12-10T21:40:17.424429+0100 mon.pve1-prg2a (mon.0) 125956 : cluster > > [DBG] osd.39 reported failed by osd.18 > > > > 2020-12-10T21:40:17.829447+0100 mon.pve1-prg2a (mon.0) 125957 : cluster > > [DBG] osd.39 reported failed by osd.36 > > > > 2020-12-10T21:40:17.847373+0100 mon.pve1-prg2a (mon.0) 125958 : cluster > > [DBG] osd.39 reported failed by osd.1 > > > > 2020-12-10T21:40:17.858371+0100 mon.pve1-prg2a (mon.0) 125959 : cluster > > [DBG] osd.39 reported failed by osd.17 > > > > 2020-12-10T21:40:17.915755+0100 mon.pve1-prg2a (mon.0) 125960 : cluster > > [DBG] osd.39 reported failed by osd.28 > > > > > > > > > > > > 2020-12-10T21:40:24.151192+0100 mon.pve1-prg2a (mon.0) 125986 : cluster > > [INF] Health check cleared: PG_AVAILABILITY (was: Reduced data > > availability: 1 pg peering) > > > > 2020-12-10T21:40:24.608038+0100 mon.pve1-prg2a (mon.0) 125987 : cluster > > [WRN] Health check update: 0 slow ops, oldest one blocked for 37 sec, > osd.8 > > has slow ops (SLOW_OPS) > > > > 2020-12-10T21:40:25.375322+0100 mgr.pve2-prg2a (mgr.91575377) 118761 : > > cluster [DBG] pgmap v119017: 1737 pgs: 1 active+clean+scrubbing+deep, 6 > > active+undersized, 40 active+undersized+wait, 13 > > active+undersized+degraded, 101 active+undersized+degraded+wait, 3 > > stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB > data, > > 7.8 TiB used, 21 TiB / 29 TiB avail; 4.0 MiB/s wr, 179 op/s; > 74043/2676420 > > objects degraded (2.766%); 622 KiB/s, 0 objects/s recovering > > > > 2020-12-10T21:40:27.379311+0100 mgr.pve2-prg2a (mgr.91575377) 118762 : > > cluster [DBG] pgmap v119018: 1737 pgs: 1 active+clean+scrubbing+deep, 6 > > active+undersized, 40 active+undersized+wait, 13 > > active+undersized+degraded, 101 active+undersized+degraded+wait, 3 > > stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB > data, > > 7.8 TiB used, 21 TiB / 29 TiB avail; 3.6 MiB/s wr, 154 op/s; > 74043/2676420 > > objects degraded (2.766%); 510 KiB/s, 0 objects/s recovering > > > > 2020-12-10T21:40:29.391905+0100 mgr.pve2-prg2a (mgr.91575377) 118763 : > > cluster [DBG] pgmap v119019: 1737 pgs: 1 active+clean+scrubbing+deep, 11 > > active+undersized, 35 active+undersized+wait, 21 > > active+undersized+degraded, 93 active+undersized+degraded+wait, 3 > > stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB > data, > > 7.8 TiB used, 21 TiB / 29 TiB avail; 3.9 MiB/s wr, 170 op/s; > 74043/2676420 > > objects degraded (2.766%); 456 KiB/s, 0 objects/s recovering > > > > 2020-12-10T21:40:29.609801+0100 mon.pve1-prg2a (mon.0) 125991 : cluster > > [WRN] Health check update: Degraded data redundancy: 74043/2676420 > objects > > degraded (2.766%), 114 pgs degraded (PG_DEGRADED) > > > > 2020-12-10T21:40:29.609845+0100 mon.pve1-prg2a (mon.0) 125992 : cluster > > [WRN] Health check update: 0 slow ops, oldest one blocked for 42 sec, > osd.8 > > has slow ops (SLOW_OPS) > > > > 2020-12-10T21:40:31.395715+0100 mgr.pve2-prg2a (mgr.91575377) 118764 : > > cluster [DBG] pgmap v119020: 1737 pgs: 1 active+clean+scrubbing+deep, 11 > > active+undersized, 35 active+undersized+wait, 21 > > active+undersized+degraded, 93 active+undersized+degraded+wait, 3 > > stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB > data, > > 7.8 TiB used, 21 TiB / 29 TiB avail; 154 KiB/s rd, 2.1 MiB/s wr, 151 > op/s; > > 74043/2676420 objects degraded (2.766%); 425 KiB/s, 0 objects/s > recovering > > > > 2020-12-10T21:40:32.797773+0100 osd.8 (osd.8) 19 : cluster [WRN] slow > > request osd_op(client.91721949.0:169935 1.3b > > 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie > > 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) > > initiated 2020-12-10T21:39:59.813054+0100 currently delayed > > > > 2020-12-10T21:40:33.399277+0100 mgr.pve2-prg2a (mgr.91575377) 118765 : > > cluster [DBG] pgmap v119021: 1737 pgs: 1 active+clean+scrubbing+deep, 13 > > active+undersized, 33 active+undersized+wait, 28 > > active+undersized+degraded, 86 active+undersized+degraded+wait, 3 > > stale+active+clean, 2 active+clean+laggy, 1571 active+clean; 3.3 TiB > data, > > 7.8 TiB used, 21 TiB / 29 TiB avail; 156 KiB/s rd, 1.6 MiB/s wr, 144 > op/s; > > 74043/2676420 objects degraded (2.766%) > > > > 2020-12-10T21:40:33.841597+0100 osd.8 (osd.8) 20 : cluster [WRN] slow > > request osd_op(client.91721949.0:169935 1.3b > > 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie > > 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) > > initiated 2020-12-10T21:39:59.813054+0100 currently delayed > > > > 2020-12-10T21:40:34.611650+0100 mon.pve1-prg2a (mon.0) 125996 : cluster > > [WRN] Health check update: 0 slow ops, oldest one blocked for 47 sec, > > daemons [osd.22,osd.8] have slow ops. (SLOW_OPS) > > > > 2020-12-10T21:40:34.612166+0100 mon.pve1-prg2a (mon.0) 125997 : cluster > > [INF] osd.39 failed (root=default,host=pve5-prg2a) (5 reporters from > > different host after 39.438374 >= grace 37.721344) > > > > 2020-12-10T21:40:34.615286+0100 mon.pve1-prg2a (mon.0) 125998 : cluster > > [WRN] Health check update: 2 osds down (OSD_DOWN) > > > > 2020-12-10T21:40:34.621801+0100 mon.pve1-prg2a (mon.0) 125999 : cluster > > [DBG] osdmap e12821: 40 total, 38 up, 30 in > > > > 2020-12-10T21:40:34.880626+0100 osd.8 (osd.8) 21 : cluster [WRN] slow > > request osd_op(client.91721949.0:169935 1.3b > > 1:dc12cc94:::rbd_header.0373d6a5f7de48:head [watch ping cookie > > 139741691437056] snapc 0=[] ondisk+write+known_if_redirected e12816) > > initiated 2020-12-10T21:39:59.813054+0100 currently delayed > > > > > > > > > > > > 8/0 sis=12819) [28,35] r=-1 lpr=12819 pi=[12507,12819)/1 > crt=12818'17226265 > > lcod 12818'17226264 mlcod 0'0 unknown NOTIFY mbc={}] exit Start 0.000020 > 0 > > 0.000000 > > > > -823> 2020-12-10T21:40:44.079+0100 7f5310751700 5 osd.4 pg_epoch: > 12822 > > pg[1.cc( v 12818'17226265 (12812'17224439,12818'17226265] > > local-lis/les=12507/12508 n=1566 ec=171/171 lis/c=12507/12507 > > les/c/f=12508/12508/0 sis=12819) [28,35] r=-1 lpr=12819 > pi=[12507,12819)/1 > > crt=12818'17226265 lcod 12818'17226264 mlcod 0'0 unknown NOTIFY mbc={}] > > enter Started/Stray > > > > -822> 2020-12-10T21:40:44.191+0100 7f5324f7a700 5 prioritycache > > tune_memory target: 4294967296 mapped: 3494428672 unmapped: 383787008 > heap: > > 3878215680 old mem: 2845415832 new mem: 2845415832 > > > > -821> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 > > > > -820> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f5312f56700' had timed out after 15 > > > > -819> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 > is_healthy > > false -- internal heartbeat failed > > > > -818> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 not > > healthy; waiting to boot > > > > -817> 2020-12-10T21:40:44.311+0100 7f532ca01700 1 osd.4 12822 tick > > checking mon for new map > > > > -816> 2020-12-10T21:40:44.311+0100 7f530e74d700 5 osd.4 12822 > heartbeat > > osd_stat(store_statfs(0x7c5f5b0000/0x2e36c0000/0xba40000000, data > > 0x3af1c2f453/0x3afd380000, compress 0x0/0x0/0x0, omap 0x30256c, meta > > 0x2e33bda94), peers [3,5,6,7,8,9,13,20,28,36] op hist > > [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,22]) > > > > -815> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: tick > > > > -814> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: > > _check_auth_rotating have uptodate secrets (they expire after > > 2020-12-10T21:40:14.409098+0100) > > > > -813> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client > log_queue > > is 2 last_log 7 sent 5 num 2 unsent 2 sending 2 > > > > -812> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client will > send > > 2020-12-10T21:40:44.063147+0100 osd.4 (osd.4) 6 : cluster [WRN] Monitor > > daemon marked osd.4 down, but it is still running > > > > -811> 2020-12-10T21:40:44.403+0100 7f5323337700 10 log_client will > send > > 2020-12-10T21:40:44.063157+0100 osd.4 (osd.4) 7 : cluster [DBG] map > e12822 > > wrongly marked me down at e12819 > > > > -810> 2020-12-10T21:40:44.403+0100 7f5323337700 10 monclient: > > _send_mon_message to mon.pve1-prg2a at v2:10.104.200.11:3300/0 > > > > -809> 2020-12-10T21:40:45.191+0100 7f5324f7a700 5 prioritycache > > tune_memory target: 4294967296 mapped: 3494428672 unmapped: 383787008 > heap: > > 3878215680 old mem: 2845415832 new mem: 2845415832 > > > > -808> 2020-12-10T21:40:45.351+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 > > > > -807> 2020-12-10T21:40:45.351+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f5312f56700' had timed out after 15 > > > > > > > > > > > > -3> 2020-12-10T21:42:10.466+0100 7f5324f7a700 5 prioritycache > tune_memory > > target: 4294967296 mapped: 3510689792 unmapped: 367525888 heap: > 3878215680 > > old mem: 2845415832 new mem: 2845415832 > > > > -2> 2020-12-10T21:42:10.694+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had timed out after 15 > > > > -1> 2020-12-10T21:42:10.694+0100 7f532ca01700 1 heartbeat_map > > is_healthy 'OSD::osd_op_tp thread 0x7f530ef4e700' had suicide timed out > > after 150 > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx