Re: radosgw stopped working

Laimis Juzeliūnas <laimis.juzeliunas@xxxxxxxxxx> · Sun, 22 Dec 2024 11:41:52 +0200

Hi Rok,

All great suggestions here, try moving around some pgs with upmap.
In case you need something very basic and simple we use this:
https://github.com/laimis9133/plankton-swarm

I'll add a target osd option this evening.

Best,
Laimis J.

On Sun, Dec 22, 2024, 10:49 Alwin Antreich <alwin.antreich@xxxxxxxx> wrote:

> Hi Rok,
>
> The full_osd state is when IO is stopped, to protect the data from
> corruption. The 95% limit can be overshot by 1-2%, be careful when you
> increase the full_osd limit.
>
> The nearfull_osd limit warns at 90% and backfill onto the OSD is halted.
> But PGs still move off the OSD, usually these warnings resolve themselves
> during data move. If it doesn't after the upgrade then ofc you need to
> look.
>
> There are a couple of circumstances that might fill up an OSD. Poor
> balance, low number of PGs (top of my head).
>
>
> You could reweight the OSDs in question to tell Ceph to move some PGs off.
> But it leaves the PG move to the algorithm.
>
> You could also use the pgremapper to manually reassign PGs to different
> OSDs. This gives you more control over PG movement. This works by setting
> upmaps, the balancer needs to be off and the ceph version needs to be
> throughout newer than Luminous.
> https://github.com/digitalocean/pgremapper
>
> I hope this helps.
>
> Cheers,
> Alwin Antreich
> croit GmbH, https://croit.io/
>
> On Sun, Dec 22, 2024, 00:07 Rok Jaklič <rjaklic@xxxxxxxxx> wrote:
>
> > You are right again.
> >
> > Thank you.
> >
> > ---
> >
> > However is this really the right way (for all IO to stop) since cluster
> > have enough capacity to rebalance?
> >
> > Why does not rebalance algorithm prevent one osd to be "too full"?
> >
> > Rok
> >
> > On Sun, Dec 22, 2024 at 12:00 AM Eugen Block <eblock@xxxxxx> wrote:
> >
> > > The full OSD is most likely the reason. You can temporarily increase
> > > the threshold to 0.97 or so, but you need to prevent that to happen.
> > > The cluster usually starts warning you at 85%.
> > >
> > > Zitat von Rok Jaklič <rjaklic@xxxxxxxxx>:
> > >
> > > > Hi,
> > > >
> > > > for some reason radosgw stopped working.
> > > >
> > > > Cluster status:
> > > > [root@ctplmon1 ~]# ceph -v
> > > > ceph version 17.2.8 (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy
> > > > (stable)
> > > > [root@ctplmon1 ~]# ceph -s
> > > >   cluster:
> > > >     id:     0a6e5422-ac75-4093-af20-528ee00cc847
> > > >     health: HEALTH_ERR
> > > >             6 OSD(s) experiencing slow operations in BlueStore
> > > >             2 backfillfull osd(s)
> > > >             1 full osd(s)
> > > >             1 nearfull osd(s)
> > > >             Low space hindering backfill (add storage if this doesn't
> > > > resolve itself): 32 pgs backfill_toofull
> > > >             Degraded data redundancy: 835306/1285383707 objects
> > degraded
> > > > (0.065%), 6 pgs degraded, 5 pgs undersized
> > > >             76 pgs not deep-scrubbed in time
> > > >             45 pgs not scrubbed in time
> > > >             Full OSDs blocking recovery: 1 pg recovery_toofull
> > > >             9 pool(s) full
> > > >             9 daemons have recently crashed
> > > >
> > > >   services:
> > > >     mon: 3 daemons, quorum ctplmon1,ctplmon3,ctplmon2 (age 36m)
> > > >     mgr: ctplmon1(active, since 65m)
> > > >     mds: 1/1 daemons up
> > > >     osd: 193 osds: 191 up (since 8m), 191 in (since 9m); 267 remapped
> > pgs
> > > >     rgw: 2 daemons active (1 hosts, 1 zones)
> > > >
> > > >   data:
> > > >     volumes: 1/1 healthy
> > > >     pools:   10 pools, 793 pgs
> > > >     objects: 257.08M objects, 292 TiB
> > > >     usage:   614 TiB used, 386 TiB / 1000 TiB avail
> > > >     pgs:     835306/1285383707 objects degraded (0.065%)
> > > >              225512620/1285383707 objects misplaced (17.544%)
> > > >              525 active+clean
> > > >              230 active+remapped+backfilling
> > > >              32  active+remapped+backfill_toofull
> > > >              5   active+undersized+degraded+remapped+backfilling
> > > >              1   active+recovery_toofull+degraded
> > > >
> > > >   io:
> > > >     recovery: 978 MiB/s, 825 objects/s
> > > >
> > > > ---
> > > >
> > > > Do not know if it is related but the cluster has been rebalancing
> for a
> > > few
> > > > days now, after I've set EC pool only to use hdd.
> > > >
> > > > ---
> > > >
> > > > If I start rgw with debug I get something like this in logs:
> > > > [root@ctplmon2 ~]# radosgw -c /etc/ceph/ceph.conf --setuser ceph
> > > --setgroup
> > > > ceph -n client.radosgw.moja.shramba.ctplmon2 -f -m
> 194.249.4.104:6789
> > > > --debug-rgw=99/99
> > > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout,
> > > failed
> > > > to initialize
> > > >
> > > > In logs I get:
> > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0  0 deferred set uid:gid to
> > > > 167:167 (ceph:ceph)
> > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0  0 ceph version 17.2.8
> > > > (f817ceb7f187defb1d021d6328fa833eb8e943b3) quincy (stable), process
> > > > radosgw, pid 168935
> > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0  0 framework: beast
> > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0  0 framework conf key:
> port,
> > > val:
> > > > 4444
> > > > 2024-12-21T23:16:59.898+0100 7f65a19257c0  1 radosgw_Main not setting
> > > numa
> > > > affinity
> > > > 2024-12-21T23:16:59.901+0100 7f65a19257c0  1 rgw_d3n:
> > > > rgw_d3n_l1_local_datacache_enabled=0
> > > > 2024-12-21T23:16:59.901+0100 7f65a19257c0  1 D3N datacache enabled: 0
> > > > 2024-12-21T23:16:59.901+0100 7f658dffb640 20 reqs_thread_entry: start
> > > > 2024-12-21T23:16:59.901+0100 7f658d7fa640 10 entry start
> > > > 2024-12-21T23:16:59.908+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: realm
> > > > 2024-12-21T23:16:59.914+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0  4 rgw main:
> RGWPeriod::init
> > > > failed to init realm  id  : (2) No such file or directory
> > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:16:59.915+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=46
> > > > 2024-12-21T23:16:59.917+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=873
> > > > 2024-12-21T23:16:59.945+0100 7f65a19257c0 20 rgw main: searching for
> > the
> > > > correct realm
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got
> > > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got
> > > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got zone_names.default
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got zonegroups_names.default
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.210+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=46
> > > > 2024-12-21T23:17:00.211+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=358
> > > > 2024-12-21T23:17:00.212+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.213+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.214+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.215+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=46
> > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got
> > > zone_info.c2c70444-7a41-4acd-a0d0-9f87d324ec72
> > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got
> > > > zonegroup_info.b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578
> > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got zone_names.default
> > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main:
> > > > RGWRados::pool_iterate: got zonegroups_names.default
> > > > 2024-12-21T23:17:00.284+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=46
> > > > 2024-12-21T23:17:00.285+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=873
> > > > 2024-12-21T23:17:00.286+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=46
> > > > 2024-12-21T23:17:00.287+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=0 bl.length=358
> > > > 2024-12-21T23:17:00.293+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 zone default found
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0  4 rgw main: Realm:
> > > >            ()
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0  4 rgw main: ZoneGroup:
> > default
> > > >            (b1e0d55c-f7cb-4e73-b1cb-6cffa1fd6578)
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0  4 rgw main: Zone:
> > default
> > > >            (c2c70444-7a41-4acd-a0d0-9f87d324ec72)
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 10 cannot find current
> period
> > > > zonegroup using local zonegroup configuration
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: zonegroup
> > default
> > > > 2024-12-21T23:17:00.295+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.296+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.299+0100 7f65a19257c0 20 rgw main: rados->read
> > ofs=0
> > > > len=0
> > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main:
> > > rados_obj.operate()
> > > > r=-2 bl.length=0
> > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started sync
> > > module
> > > > instance, tier type =
> > > > 2024-12-21T23:17:00.303+0100 7f65a19257c0 20 rgw main: started zone
> > > > id=c2c70444-7a41-4acd-a0d0-9f87d324ec72 (name=default) with tier
> type =
> > > > 2024-12-21T23:21:59.898+0100 7f659e380640 -1 Initialization timeout,
> > > failed
> > > > to initialize
> > > >
> > > > ---
> > > >
> > > > Any ideas what might cause rgw to stop working?
> > > >
> > > > Kind regards,
> > > > Rok
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx