Here's what I'm currently thinking. First we should get a few things out of the way, like - removing deepsea and ansible https://github.com/ceph/ceph/pull/33126 - ceph orchestrator ... -> ceph orch ... https://github.com/ceph/ceph/pull/33131 - your rename PR, if the underlying bug is resolved Next I think we need to fix the shape of the CLI to resolve the service group vs service/daemon ambiguity. Here's my proposal: https://pad.ceph.com/p/orchestrator_cli Then I think we can proceed more quickly in parallel with adding the additional services (monitoring, nfs, etc.) and improving the cephadm internals. On the internals side, the core problem I see is _get_services(), which has basically two users: - callers in cephadm that need a recent view in order to make decisions about scheduling, placement - serve() and 'service ls --refresh', which need to trigger an actual scrape of the remote hosts. The serve() one is the most important, IMO: we need it to (1) be parallel, (2) gracefully handle errors for each host and raise appropriate health alerts, and (3) update the cache as appropriate. For the CLI case, whether it triggers the scrape synchrnously or somehow kicks serve() and waits is an probably-not-so-important detail. On the other hand, the remaining internal _get_services() callers should I think all just use the latest cached state. Right now the way the code is structured makes it very confusing which path is used for which, and the use of the async_map_completion help (currently, at least) makes it hard to tell which host failed. As for additional services (monitoring, nfs, etc.), I think that can proceed more quickly once we have the CLI and add/remove/update issues sorted out. I may start with a RFC PR on that, but I would really like some feedback on whether the proposal makes sense. Thanks! sage _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx