After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a problem where the new ceph-mgr would sometimes hang indefinitely when doing commands like 'ceph pg dump' on our largest cluster
(~1,300 OSDs). The rest of our clusters (10+) aren't seeing the same issue, but they are all under 600 OSDs each. Restarting ceph-mgr seems to fix the issue for 12 hours or so, but usually overnight it'll get back into the state where the hang reappears.
At first I thought it was a hardware issue, but switching the primary ceph-mgr to another node didn't fix the problem. I've increased the logging to 20/20 for debug_mgr, and while a working dump looks like this: 2018-10-18 09:26:16.256911 7f9dbf5e7700 4 mgr.server handle_command decoded 3 2018-10-18 09:26:16.256917 7f9dbf5e7700 4 mgr.server handle_command prefix=pg dump 2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command client.admin capable 2018-10-18 09:26:16.256951 7f9dbf5e7700 0 log_channel(audit) log [DBG] : from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]:
dispatch 2018-10-18 09:26:22.567583 7f9dbf5e7700 1 mgr.server reply handle_command (0) Success dumped all A failed dump call doesn't show up at all. The "mgr.server handle_command prefix=pg dump" log entry doesn't seem to even make it to the logs. This problem also continued to appear after upgrading to 12.2.8. Has anyone else seen this? Thanks, Bryan |
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com