Hi all, with the help of Croit we got back on our feet. I will post a detailed post-mortem later this month including information about how to check if a cluster is in the same situation. Long story short, we hit a deadlock due to competition between MDS cache trimming and purging stale strays. "Disabling" cache trimming by setting a ridiculously high mds_memory_limit on the bad rank did the trick. Purging 100Mio strays is actually no problem and doesn't require much if any RAM by itself (I mean here the purge that happens on MDS restart, I don't know if the forward-scrub purge behaves the same). Our cluster managed to purge about 10K items/s and after a few hours everything was cleaned out. While purging it was serving client IO, so the FS is up right away. A big thank you to everyone who helped with this case. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Eugen Block <eblock@xxxxxx> Sent: Monday, January 20, 2025 6:49 PM To: Frank Schilder Cc: ceph-users@xxxxxxx Subject: Re: Re: MDS hung in purge_stale_snap_data after populating cache A colleague of mine suggested to create a coredump when the MDS has become stale and then inspect it with gdb. But if you think it’s more promising to increase the buffer, or maybe it’s quicker to test, then do that first. Zitat von Frank Schilder <frans@xxxxxx>: >> which is 3758096384. I'm not even sure what the unit is, probably bytes? > > Sorry, it is bytes. Our items are about 100b on average, that's how > we observe approximately 37462448 executions of > purge_stale_snap_data until the queue is filled up. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Frank Schilder <frans@xxxxxx> > Sent: Monday, January 20, 2025 1:51 PM > To: Eugen Block > Cc: ceph-users@xxxxxxx > Subject: Re: MDS hung in purge_stale_snap_data after > populating cache > >> which is 3758096384. I'm not even sure what the unit is, probably bytes? > > As far as I understand the unit is "list items". They can have > variable length. On our system about 400G are allocated while > filling up the bufferlist. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx