ceph-mgr hangs on larger clusters in Luminous

Bryan Stillwell <bstillwell@xxxxxxxxxxx> · Thu, 18 Oct 2018 17:16:47 +0000

After we upgraded from Jewel (10.2.10) to Luminous (12.2.5) we started seeing a problem where the new ceph-mgr would sometimes hang indefinitely when doing commands like 'ceph pg dump' on our largest cluster
 (~1,300 OSDs).  The rest of our clusters (10+) aren't seeing the same issue, but they are all under 600 OSDs each.  Restarting ceph-mgr seems to fix the issue for 12 hours or so, but usually overnight it'll get back into the state where the hang reappears. 
 At first I thought it was a hardware issue, but switching the primary ceph-mgr to another node didn't fix the problem.

I've increased the logging to 20/20 for debug_mgr, and while a working dump looks like this:

2018-10-18 09:26:16.256911 7f9dbf5e7700  4 mgr.server handle_command decoded 3
2018-10-18 09:26:16.256917 7f9dbf5e7700  4 mgr.server handle_command prefix=pg dump
2018-10-18 09:26:16.256937 7f9dbf5e7700 10 mgr.server _allowed_command  client.admin capable
2018-10-18 09:26:16.256951 7f9dbf5e7700  0 log_channel(audit) log [DBG] : from='client.1414554763 10.2.4.2:0/2175076978' entity='client.admin' cmd=[{"prefix": "pg dump", "target": ["mgr", ""], "format": "json-pretty"}]:
 dispatch
2018-10-18 09:26:22.567583 7f9dbf5e7700  1 mgr.server reply handle_command (0) Success dumped all

A failed dump call doesn't show up at all.  The "mgr.server handle_command prefix=pg dump" log entry doesn't seem to even make it to the logs.

This problem also continued to appear after upgrading to 12.2.8.

Has anyone else seen this?

Thanks,
Bryan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com