Hi Rok, All great suggestions here, try moving around some pgs with upmap. In case you need something very basic and simple we use this: https://github.com/laimis9133/plankton-swarm I'll add a target osd option this evening. Best, Laimis J. On Sun, Dec 22, 2024, 10:49 Alwin Antreich <alwin.antreich@xxxxxxxx> wrote: > Hi Rok, > > The full_osd state is when IO is stopped, to protect the data from > corruption. The 95% limit can be overshot by 1-2%, be careful when you > increase the full_osd limit. > > The nearfull_osd limit warns at 90% and backfill onto the OSD is halted. > But PGs still move off the OSD, usually these warnings resolve themselves > during data move. If it doesn't after the upgrade then ofc you need to > look. > > There are a couple of circumstances that might fill up an OSD. Poor > balance, low number of PGs (top of my head). > > > You could reweight the OSDs in question to tell Ceph to move some PGs off. > But it leaves the PG move to the algorithm. > > You could also use the pgremapper to manually reassign PGs to different > OSDs. This gives you more control over PG movement. This works by setting > upmaps, the balancer needs to be off and the ceph version needs to be > throughout newer than Luminous. > https://github.com/digitalocean/pgremapper > > I hope this helps. > > Cheers, > Alwin Antreich > croit GmbH, https://croit.io/ > > On Sun, Dec 22, 2024, 00:07 Rok Jaklič <rjaklic@xxxxxxxxx> wrote: > > > You are right again. > > > > Thank you. > > > > --- > > > > However is this really the right way (for all IO to stop) since cluster > > have enough capacity to rebalance? > > > > Why does not rebalance algorithm prevent one osd to be "too full"? > > > > Rok > > > > On Sun, Dec 22, 2024 at 12:00 AM Eugen Block <eblock@xxxxxx> wrote: > > > > > The full OSD is most likely the reason. You can temporarily increase > > > the threshold to 0.97 or so, but you need to prevent that to happen. > > > The cluster usually starts warning you at 85%. > > > > > > Zitat von Rok Jaklič <rjaklic@xxxxxxxxx>: > > > > > > > Hi, > > > > > > > > for some reason radosgw stopped working. > > > > > > > > Cluster status: > > > > [root@ctplmon1 ~]# ceph -v > > > > ceph version 17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy > > > > (stable) > > > > [root@ctplmon1 ~]# ceph -s > > > > cluster: > > > > id: 0a6e5422-ac75-4093-af20-528ee00cc847 > > > > health: HEALTH_ERR > > > > 6 OSD(s) experiencing slow operations in BlueStore > > > > 2 backfillfull osd(s) > > > > 1 full osd(s) > > > > 1 nearfull osd(s) > > > > Low space hindering backfill (add storage if this doesn't > > > > resolve itself): 32 pgs backfill_toofull > > > > Degraded data redundancy: 835306/1285383707 objects > > degraded > > > > (0.065%), 6 pgs degraded, 5 pgs undersized > > > > 76 pgs not deep-scrubbed in time > > > > 45 pgs not scrubbed in time > > > > Full OSDs blocking recovery: 1 pg recovery_toofull > > > > 9 pool(s) full > > > > 9 daemons have recently crashed > > > > > > > > services: > > > > mon: 3 daemons, quorum ctplmon1,ctplmon3,ctplmon2 (age 36m) > > > > mgr: ctplmon1(active, since 65m) > > > > mds: 1/1 daemons up > > > > osd: 193 osds: 191 up (since 8m), 191 in (since 9m); 267 remapped > > pgs > > > > rgw: 2 daemons active (1 hosts, 1 zones) > > > > > > > > data: > > > > volumes: 1/1 healthy > > > > pools: 10 pools, 793 pgs > > > > objects: 257.08M objects, 292 TiB > > > > usage: 614 TiB used, 386 TiB / 1000 TiB avail > > > > pgs: 835306/1285383707 objects degraded (0.065%) > > > > 225512620/1285383707 objects misplaced (17.544%) > > > > 525 active+clean > > > > 230 active+remapped+backfilling > > > > 32 active+remapped+backfill_toofull > > > > 5 active+undersized+degraded+remapped+backfilling > > > > 1 active+recovery_toofull+degraded > > > > > > > > io: > > > > recovery: 978 MiB/s, 825 objects/s > > > > > > > > --- > > > > > > > > Do not know if it is related but the cluster has been rebalancing > for a > > > few > > > > days now, after I've set EC pool only to use hdd. > > > > > > > > --- > > > > > > > > If I start rgw with debug I get something like this in logs: > > > > [root@ctplmon2 ~]# radosgw -c /etc/ceph/ceph.conf --setuser ceph > > > --setgroup > > > > ceph -n client.radosgw.moja.shramba.ctplmon2 -f -m > 194.249.4.104:6789 > > > > --debug-rgw=99/99 > > > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout, > > > failed > > > > to initialize > > > > > > > > In logs I get: > > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 deferred set uid:gid to > > > > 167:167 (ceph:ceph) > > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 ceph version 17.2.8 > > > > (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable), process > > > > radosgw, pid 168935 > > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 framework: beast > > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 0 framework conf key: > port, > > > val: > > > > 4444 > > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0 1 radosgw_Main not setting > > > numa > > > > affinity > > > > 2024-12-21T23:16:59.901+0100 7f65a19257c0 1 rgw_d3n: > > > > rgw_d3n_l1_local_datacache_enabled=0 > > > > 2024-12-21T23:16:59.901+0100 7f65a19257c0 1 D3N datacache enabled: 0 > > > > 2024-12-21T23:16:59.901+0100 7f658dffb640 20 reqs_thread_entry: start > > > > 2024-12-21T23:16:59.901+0100 7f658d7fa640 10 entry start > > > > 2024-12-21T23:16:59.908+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: realm > > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 4 rgw main: > RGWPeriod::init > > > > failed to init realm id : (2) No such file or directory > > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=46 > > > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=873 > > > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main: searching for > > the > > > > correct realm > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got > > > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72 > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got > > > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578 > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got zone_names.default > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got zonegroups_names.default > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=46 > > > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=358 > > > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.215+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=46 > > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got > > > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72 > > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got > > > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578 > > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got zone_names.default > > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: > > > > RGWRados::pool_iterate: got zonegroups_names.default > > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=46 > > > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=873 > > > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=46 > > > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=0 bl.length=358 > > > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 zone default found > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: Realm: > > > > () > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: ZoneGroup: > > default > > > > (b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578) > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 4 rgw main: Zone: > > default > > > > (c2c70444-7a41-4acd-a0d0-9f87d324ec72) > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 10 cannot find current > period > > > > zonegroup using local zonegroup configuration > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: zonegroup > > default > > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main: rados->read > > ofs=0 > > > > len=0 > > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: > > > rados_obj.operate() > > > > r=-2 bl.length=0 > > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started sync > > > module > > > > instance, tier type = > > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started zone > > > > id=c2c70444-7a41-4acd-a0d0-9f87d324ec72 (name=default) with tier > type = > > > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout, > > > failed > > > > to initialize > > > > > > > > --- > > > > > > > > Any ideas what might cause rgw to stop working? > > > > > > > > Kind regards, > > > > Rok > > > > _______________________________________________ > > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx