Dear all, a quick update and some answers. We set up a dedicated host for running an MDS and debugging the problem. On this host we have 750G RAM, 4T swap and 4T log, both on fast SSDs. Plan is to monitor with "perf top" the MDS becoming the designated MDS for the problematic rank and also pull out a detailed log about the startup until the MDS hangs. I have some questions about that, a new observation that might be relevant and some answers to some suggestions, in that order. I need to install debug info for perf to give useful output. I can't find a meta package that has all ceph-debuginfo rpms as a dependency. Which ones should I install? Also, should I install some kernel debug info packages? Please note that every restart takes about 1h to hit the issue. I would like to have as much as possible installed the first time. A new observation: After every restart cycle the rank loads a little bit less into cache. However, the num_stray count does not decrease. Could that mean the problem is not the high num_stray count but something else? Answer to a suggestion: It is not possible to access anything on the MDS or the entire file system. Hence, trying to stat some files/dirs is not possible. Furthermore, not all stray items can be reintegrated with this method and I'm afraid our stray items are mostly of this nature. This means that in octopus the only way to trim (evaluate) stray items was an MDS restart. For details, the relevant discussions are https://www.spinics.net/lists/ceph-users/msg70459.html and https://www.spinics.net/lists/ceph-users/msg73150.html with the most important info in this message: https://www.spinics.net/lists/ceph-users/msg70849.html . Summary: As part of the debugging back then I executed a recursive stat-ing of files and directories on the *entire* file system only to observe that the stray count didn't change. This was when Gregory finally explained that hard links can block stray removal on snaptrim also for paths that are no longer accessible through the file system or any snapshots, that is, the usual stray evaluation doesn't have any effect. That's the situation we are in, we need the MDS do it itself. A correction: It was actually Venky Shankar participating in this communication and not Neha. Is Venkhy still working on the FS? Thanks for package hints and best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx