OSD refuses to start (OOMK) due to pg split

Tor Martin Ølberg <tmolberg@xxxxxxxxx> · Fri, 9 Jul 2021 07:48:11 +0200

After an upgrade to 15.2.13 from 15.2.4 this small home lab cluster ran
into issues with OSDs failing on all four hosts. This might be unrelated to
the upgrade but it looks like the trigger has been an autoscaling event
where the RBD PG pool has been scaled from 128 PGs to 512 PGs.

Only some OSDs are affected, and during the OSD startup the following
output can be observed:

2021-07-08T03:57:55.496+0200 7fc7303ff700 10 osd.17 146136 split_pgs
splitting pg[5.25( v 146017'38948152 (146011'38947652,146017'38948152]
local-lis/les=146012/146013 n=1168 ec=2338/46 lis/c=146012/145792
les/c/f=146013/145793/36878 sis=146019) [17,6] r=0 lpr=146019
pi=[145792,146019)/1 crt=146017'38948152 lcod 0'0 mlcod 0'0 unknown mbc={}]
into 5.a5

Exporting/Remove the PGs belonging to poolid seems to resolve the issue of
OOMK but yields dataloss (naturally)

There isn't a lot of activity in the log (20/20 logging) but everything
seems to revolve around splitting PGs. Full OSD startup log attached.

At this point ive exported all the troublesome PGs and gotten all the OSDs
online.

Attempting to start one of the troubled OSDs with the troubled PG will
result in all memory (80GiB) to be exhausted before the OOM killer steps
in. Looking at dump of mempools buffer_anon looks severely high? Memory
leak?

Any guidance on how to further troubleshoot this issue would be greatly
appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx