John, Aha, thanks for that -- that got me closer to the problem. I forgot an important detail: A few days before the upgrade, I set the cluster and public networks in the config files on the nodes to the "back-end" network, which the MON nodes don't have access to. I suspected that this was a bad idea at the time, but since it didn't break anything (we are still in test mode on this cluster, so downtime is completely fine), I figured it somehow didn’t matter. I must have forgotten to restart the ceph service on the MONs so the symptom didn't appear until the ceph upgrade. I just switched the public network back to the "front-end network", which the MONs do have access to, and now the ceph_rest_api runs fine (and your "tell osd.0 version" does as well). So that problem's solved. But now we're back to the original problem, which is why I was monkeying with the "public network" config entry to begin with. Let me explain: As I said, we have two separate networks: 10.197.5.0/24 - The "front-end" network, "skinny pipe", all 1Gbe, intended to be a management or control plane network 10.174.1.0/24 - The "back-end" network, "fat pipe", all OSD nodes use 2x bonded 10Gbe, intended to be a data network So we want all of the OSD traffic to go over the "back end", and the MON traffic to go over the "front end". We thought the following would do that: public network = 10.197.5.0/24 # skinny pipe, mgmt & MON traffic cluster network = 10.174.1.0/24 # fat pipe, OSD traffic But that doesn't seem to be the case -- iftop and netstat show that little/no OSD communication is happening over the 10.174.1 network and it's all happening over the 10.197.5 network. What configuration should we be running to enforce the networks per our design? Thanks! Jon Heese Systems Engineer INetU Managed Hosting P: 610.266.7441 x 261 F: 610.266.7434 www.inetu.net ** This message contains confidential information, which also may be privileged, and is intended only for the person(s) addressed above. Any unauthorized use, distribution, copying or disclosure of confidential and/or privileged information is strictly prohibited. If you have received this communication in error, please erase all copies of the message and its attachments and notify the sender immediately via reply e-mail. ** -----Original Message----- From: John Spray [mailto:jspray@xxxxxxxxxx] Sent: Thursday, October 22, 2015 12:48 PM To: Jon Heese <jheese@xxxxxxxxx> Cc: ceph-users@xxxxxxxxxxxxxx Subject: Re: Problems with ceph_rest_api after update On Thu, Oct 22, 2015 at 3:36 PM, Jon Heese <jheese@xxxxxxxxx> wrote: > Hello, > > > > We are running a Ceph cluster with 3x CentOS 7 MON nodes, and after we > updated the ceph packages on the MONs yesterday (from 0.94.3 to > 0.94.4), the ceph_rest_api started refusing to run, giving the > following error 30 seconds after it’s started: Weird. Does this work? "ceph --id admin tell osd.0 version" get_command_descriptions is ceph_rest_api's way of asking an OSD to tell it what operations are supported. It's sent from ceph_rest_api to an OSD the same way a 'tell' command is sent from the CLI (although you can't actually issue get_command_descriptions with the CLI). ceph_rest_api is picking the last up OSD it can see, as an arbitrary place to send the query, so if you have for example an up OSD that isn't really responsive, it could cause a problem. John > > > > [root@ceph-mon01 ~]# /usr/bin/ceph-rest-api -c /etc/ceph/ceph.conf > --cluster ceph -i admin > > Traceback (most recent call last): > > File "/usr/bin/ceph-rest-api", line 59, in <module> > > rest, > > File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 503, > in generate_app > > addr, port = api_setup(app, conf, cluster, clientname, clientid, > args) > > File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 145, > in api_setup > > target=('osd', int(osdid))) > > File "/usr/lib/python2.7/site-packages/ceph_rest_api.py", line 83, > in get_command_descriptions > > raise EnvironmentError(ret, err) > > EnvironmentError: [Errno -4] Can't get command descriptions: > > > > Nothing else was changed, only the packages were updated. I’ve looked > at the python, and it seems to be timing out waiting for this line to > complete, but I’m not sure where to look next in terms of what > “get_command_descriptions” actually does: > > > > ret, outbuf, outs = json_command(cluster, target, > > > prefix='get_command_descriptions', > > timeout=30) > > > > Is this a known issue? If not, does anyone have any suggestions of > how to further troubleshoot this further? Thanks in advance. > > > > Jon Heese > Systems Engineer > INetU Managed Hosting > P: 610.266.7441 x 261 > F: 610.266.7434 > www.inetu.net > > ** This message contains confidential information, which also may be > privileged, and is intended only for the person(s) addressed above. > Any unauthorized use, distribution, copying or disclosure of > confidential and/or privileged information is strictly prohibited. If > you have received this communication in error, please erase all copies > of the message and its attachments and notify the sender immediately > via reply e-mail. ** > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com