Ceph unresponsive on manager restart

Roman Steinhart <roman@xxxxxxxxxxx> · Wed, 1 Dec 2021 13:43:17 +0100

Hi all,

We're currently troubleshooting our Ceph cluster.
It appears that every time the active manager switches or restarts the
whole cluster becomes slow/unresponsive for a short period of time.
Everytime that happens we also see a lot of leader elections in the
monitors and down monitor reports when doing "ceph status".

We tried to troubleshoot the issue, and currently, we suppose that the root
cause is somewhere in the monitors.
We discovered that each time that issue happens the ms_dispatch
and msgr-worker-* threads are at 100% CPU usage.
It looks like this:
https://drive.google.com/file/d/16XmaTM4ILhYSg76IzZqIkqZP2V8sYl_g/view?usp=sharing
The leader elections are probably a side effect of that high CPU usage.

CPU usage graph of the monitor docker container:
https://drive.google.com/file/d/13iWv6i4VIo1E5FhYWwqZPSSuYpi3Epy_/view?usp=sharing
Monitor log with debug_mon = 20:
https://drive.google.com/file/d/1Q6xoa1PJwrDq8oYYui-59KBYWV8g3QQu/view?usp=sharing

We already tried to modify the configuration of the monitors:
1) Set ms_async_op_threads=10,ms_async_max_op_threads=16 -> This helped to
improve the CPU usage for the msgr-worker-* threads, ms_dispatch is
unfortunately still at 100%
2) ms_async_send_inline=true -> slight improvement but ms_dispatch still
hits 100%
3) ms_nocrc=true -> no improvement
4) Moving monitor to a node with higher single core CPU performance -> no
improvement

To be honest, we don't know if the issue is even related to these
ms_dispatch, msgr-worker threads. But that's the only thing we found that
is reaching some sort of limit.

We use Ceph 15.2.15 with cephadm backend

Thanks,

~ Roman
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx