You are right again. Thank you. --- However is this really the right way (for all IO to stop) since cluster have enough capacity to rebalance? Why does not rebalance algorithm prevent one osd to be "too full"? Rok On Sun, Dec 22, 2024 at 12:00 AM Eugen Block <eblock@xxxxxx> wrote: > The full OSD is most likely the reason. You can temporarily increase > the threshold to 0.97 or so, but you need to prevent that to happen. > The cluster usually starts warning you at 85%. > > Zitat von Rok Jaklič <rjaklic@xxxxxxxxx>: > > > Hi, > > > > for some reason radosgw stopped working. > > > > Cluster status: > > [root@ctplmon1 ~]# ceph -v > > ceph version 17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy > > (stable) > > [root@ctplmon1 ~]# ceph -s > > cluster: > > id: 0a6e5422-ac75-4093-af20-528ee00cc847 > > health: HEALTH_ERR > > 6 OSD(s) experiencing slow operations in BlueStore > > 2 backfillfull osd(s) > > 1 full osd(s) > > 1 nearfull osd(s) > > Low space hindering backfill (add storage if this doesn't > > resolve itself): 32 pgs backfill_toofull > > Degraded data redundancy: 835306/1285383707 objects degraded > > (0.065%), 6 pgs degraded, 5 pgs undersized > > 76 pgs not deep-scrubbed in time > > 45 pgs not scrubbed in time > > Full OSDs blocking recovery: 1 pg recovery_toofull > > 9 pool(s) full > > 9 daemons have recently crashed > > > > services: > > mon: 3 daemons, quorum ctplmon1,ctplmon3,ctplmon2 (age 36m) > > mgr: ctplmon1(active, since 65m) > > mds: 1/1 daemons up > > osd: 193 osds: 191 up (since 8m), 191 in (since 9m); 267 remapped pgs > > rgw: 2 daemons active (1 hosts, 1 zones) > > > > data: > > volumes: 1/1 healthy > > pools: 10 pools, 793 pgs > > objects: 257.08M objects, 292 TiB > > usage: 614 TiB used, 386 TiB / 1000 TiB avail > > pgs: 835306/1285383707 objects degraded (0.065%) > > 225512620/1285383707 objects misplaced (17.544%) > > 525 active+clean > > 230 active+remapped+backfilling > > 32 active+remapped+backfill_toofull > > 5 active+undersized+degraded+remapped+backfilling > > 1 active+recovery_toofull+degraded > > > > io: > > recovery: 978 MiB/s, 825 objects/s > > > > --- > > > > Do not know if it is related but the cluster has been rebalancing for a > few > > days now, after I've set EC pool only to use hdd. > > > > --- > > > > If I start rgw with debug I get something like this in logs: > > [root@ctplmon2 ~]# radosgw -c /etc/ceph/ceph.conf --setuser ceph > --setgroup > > ceph -n client.radosgw.moja.shramba.ctplmon2 -f -m 194.249.4.104:6789 > > --debug-rgw=99/99 > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout, > failed > > to initialize > > > > In logs I get: > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 deferred set uid:gid to > > 167:167 (ceph:ceph) > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 ceph version 17.2.8 > > (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable), process > > radosgw, pid 168935 > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 framework: beast > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 framework conf key: port, > val: > > 4444 > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 1 radosgw_Main not setting > numa > > affinity > > 2024-12-21T23:16:59.901+0100 7f65a19257c0 1 rgw_d3n: > > rgw_d3n_l1_local_datacache_enabled=0 > > 2024-12-21T23:16:59.901+0100 7f65a19257c0 1 D3N datacache enabled: 0 > > 2024-12-21T23:16:59.901+0100 7f658dffb640 20 reqs_thread_entry: start > > 2024-12-21T23:16:59.901+0100 7f658d7fa640 10 entry start > > 2024-12-21T23:16:59.908+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: realm > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 4 rgw main: RGWPeriod::init > > failed to init realm id : (2) No such file or directory > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=46 > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=873 > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main: searching for the > > correct realm > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72 > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578 > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got zone_names.default > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got zonegroups_names.default > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=46 > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=358 > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.215+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=46 > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72 > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578 > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got zone_names.default > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > RGWRados::pool_iterate: got zonegroups_names.default > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=46 > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=873 > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=46 > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=0 bl.length=358 > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 zone default found > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: Realm: > > () > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: ZoneGroup: default > > (b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578) > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: Zone: default > > (c2c70444-7a41-4acd-a0d0-9f87d324ec72) > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 10 cannot find current period > > zonegroup using local zonegroup configuration > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: zonegroup default > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main: rados->read ofs=0 > > len=0 > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: > rados_obj.operate() > > r=-2 bl.length=0 > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started sync > module > > instance, tier type = > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started zone > > id=c2c70444-7a41-4acd-a0d0-9f87d324ec72 (name=default) with tier type = > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout, > failed > > to initialize > > > > --- > > > > Any ideas what might cause rgw to stop working? > > > > Kind regards, > > Rok > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx