Hi all, We're currently troubleshooting our Ceph cluster. It appears that every time the active manager switches or restarts the whole cluster becomes slow/unresponsive for a short period of time. Everytime that happens we also see a lot of leader elections in the monitors and down monitor reports when doing "ceph status". We tried to troubleshoot the issue, and currently, we suppose that the root cause is somewhere in the monitors. We discovered that each time that issue happens the ms_dispatch and msgr-worker-* threads are at 100% CPU usage. It looks like this: https://drive.google.com/file/d/16XmaTM4ILhYSg76IzZqIkqZP2V8sYl_g/view?usp=sharing The leader elections are probably a side effect of that high CPU usage. CPU usage graph of the monitor docker container: https://drive.google.com/file/d/13iWv6i4VIo1E5FhYWwqZPSSuYpi3Epy_/view?usp=sharing Monitor log with debug_mon = 20: https://drive.google.com/file/d/1Q6xoa1PJwrDq8oYYui-59KBYWV8g3QQu/view?usp=sharing We already tried to modify the configuration of the monitors: 1) Set ms_async_op_threads=10,ms_async_max_op_threads=16 -> This helped to improve the CPU usage for the msgr-worker-* threads, ms_dispatch is unfortunately still at 100% 2) ms_async_send_inline=true -> slight improvement but ms_dispatch still hits 100% 3) ms_nocrc=true -> no improvement 4) Moving monitor to a node with higher single core CPU performance -> no improvement To be honest, we don't know if the issue is even related to these ms_dispatch, msgr-worker threads. But that's the only thing we found that is reaching some sort of limit. We use Ceph 15.2.15 with cephadm backend Thanks, ~ Roman _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx