Hi everyone, The non-core daemon registrations in servicemap vs cephadm came up twice in the last couple of weeks: First, https://github.com/ceph/ceph/pull/40035 changed rgw to register as rgw.$id.$gid and made cephadm complain about stray unmanaged daemons. The motivation was that the PR allows multiple radosgw daemons to share the same auth name + key and still show up in the servicemap. Then, today, I noticed that cephfs-mirror caused the same cephadm error because was registering as cephfs-mirror.$gid instead of the cephfs-mirror.$id that cephadm expected. I went to fix that in cephfs-mirror, but noticed that the behavior was copied from rbd-mirror.. which wasn't causing any cephadm error. It turns out that cephadm has some special code from rbd-mirror to identify daemons in the servicemap: https://github.com/ceph/ceph/blob/master/src/pybind/mgr/cephadm/serve.py#L412-L420 So to fix cephfs-mirror, I opted to keep the existing behavior and adjust cephadm: https://github.com/ceph/ceph/pull/40220/commits/30d87f3746ff9daf219366354f24c0d8e306844a For now, at least, that solves the problem. But, as things stand rgw and {cephfs,rbd}-mirror are behaving a bit differently with servicemap. The registrations look like so: { "epoch": 538, "modified": "2021-03-18T17:28:12.500356-0400", "services": { "cephfs-mirror": { "daemons": { "summary": "", "4220": { "start_epoch": 501, "start_stamp": "2021-03-18T12:49:32.929888-0400", "gid": 4220, "addr": "10.3.64.25:0/3521332238", "metadata": { ... "id": "dael.csfspq", "instance_id": "4220", ... }, "task_status": {} } } }, "rbd-mirror": { "daemons": { "summary": "", "4272": { "start_epoch": 531, "start_stamp": "2021-03-18T16:31:26.540108-0400", "gid": 4272, "addr": "10.3.64.25:0/2576541551", "metadata": { ... "id": "dael.kfenmm", "instance_id": "4272", ... }, "task_status": {} }, "4299": { "start_epoch": 534, "start_stamp": "2021-03-18T16:52:59.027580-0400", "gid": 4299, "addr": "10.3.64.25:0/600966616", "metadata": { ... "id": "dael.yfhmmq", "instance_id": "4299", ... }, "task_status": {} } } }, "rgw": { "daemons": { "summary": "", "foo.dael.hwyogi": { "start_epoch": 537, "start_stamp": "2021-03-18T17:27:58.998535-0400", "gid": 4319, "addr": "10.3.64.25:0/3084463187", "metadata": { ... "zone_id": "6321d54d-d780-43f3-af53-ce52aed2ef8a", "zone_name": "default", "zonegroup_id": "e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f", "zonegroup_name": "default" }, "task_status": {} }, "foo.dael.pyvurh": { "start_epoch": 537, "start_stamp": "2021-03-18T17:27:58.999620-0400", "gid": 4318, "addr": "10.3.64.25:0/2303221705", "metadata": { ... "zone_id": "6321d54d-d780-43f3-af53-ce52aed2ef8a", "zone_name": "default", "zonegroup_id": "e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f", "zonegroup_name": "default" }, "task_status": {} }, "foo.dael.rqipjp": { "start_epoch": 538, "start_stamp": "2021-03-18T17:28:10.866327-0400", "gid": 4330, "addr": "10.3.64.25:0/4039152887", "metadata": { ... "zone_id": "6321d54d-d780-43f3-af53-ce52aed2ef8a", "zone_name": "default", "zonegroup_id": "e8453745-84a7-4d58-9aa9-9bfaf1ce9a7f", "zonegroup_name": "default" }, "task_status": {} } } } } } With the *-mirror approach, the servicemap "key" is always the gid, and you have to look at the "id" to see how the daemon is named/authenticated. With rgw, the name is the key and there is no "id" key. I'm inclined to just go with the gid-as-key for rgw too and add the "id" key so that we are behaving consistently. This would have the side-effect of also solving the original goal of allowing many rgw daemons to share the same auth identity and still show up in the servicemap. The downside is that interpreting the service for the running daemons is a bit more work. For example, currently ceph -s shows services: mon: 1 daemons, quorum a (age 2d) mgr: x(active, since 58m) osd: 1 osds: 1 up (since 2d), 1 in (since 2d) cephfs-mirror: 1 daemon active (4220) rbd-mirror: 2 daemons active (4272, 4299) rgw: 2 daemons active (foo.dael.rqipjp, foo.dael.sajkvh) Showing the gids there is clearly now what we want. But similarly showing the daemon names is probably also a bad idea since it won't scale beyond ~3 or so; we probably just want a simple count. Reasonable? sage _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx