After an upgrade to 15.2.13 from 15.2.4 this small home lab cluster ran into issues with OSDs failing on all four hosts. This might be unrelated to the upgrade but it looks like the trigger has been an autoscaling event where the RBD PG pool has been scaled from 128 PGs to 512 PGs. Only some OSDs are affected, and during the OSD startup the following output can be observed: 2021-07-08T03:57:55.496+0200 7fc7303ff700 10 osd.17 146136 split_pgs splitting pg[5.25( v 146017'38948152 (146011'38947652,146017'38948152] local-lis/les=146012/146013 n=1168 ec=2338/46 lis/c=146012/145792 les/c/f=146013/145793/36878 sis=146019) [17,6] r=0 lpr=146019 pi=[145792,146019)/1 crt=146017'38948152 lcod 0'0 mlcod 0'0 unknown mbc={}] into 5.a5 Exporting/Remove the PGs belonging to poolid seems to resolve the issue of OOMK but yields dataloss (naturally) There isn't a lot of activity in the log (20/20 logging) but everything seems to revolve around splitting PGs. Full OSD startup log attached. At this point ive exported all the troublesome PGs and gotten all the OSDs online. Attempting to start one of the troubled OSDs with the troubled PG will result in all memory (80GiB) to be exhausted before the OOM killer steps in. Looking at dump of mempools buffer_anon looks severely high? Memory leak? Any guidance on how to further troubleshoot this issue would be greatly appreciated. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx