Next thing I've tried is taking a low-impact host and purging all Ceph/Podman state from it to re-install it from scratch (a Rados GW instance, in this case). But now seeing this strange error in just re-adding the host via a "ceph orch host add" , at the point where a disk inventory is attempted to be taken ; 2021-01-11 12:01:16,028 DEBUG Running command: /bin/podman run --rm --ipc=host --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk -e CONTAINER_IMAGE=docker.io/ceph/ceph:v15.2.7 -e NODE_NAME=ceph02-rgw-01 -v /var/run/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c:/var/run/ceph:z -v /var/log/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c:/var/log/ceph:z -v /var/lib/ceph/fbbe7cac-3324-11eb-8186-34800d5b932c/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm docker.io/ceph/ceph:v15.2.7 inventory --format=json --filter-for-batch 2021-01-11 12:01:16,518 INFO /bin/podman:stderr usage: ceph-volume inventory [-h] [--format {plain,json,json-pretty}] [path] 2021-01-11 12:01:16,519 INFO /bin/podman:stderr ceph-volume inventory: error: unrecognized arguments: --filter-for-batch Deployment of RGW container to the host is similarly blocked again. This seems to be covered under this issue; https://tracker.ceph.com/issues/48694 for ceph-volume, but may not have been addressed as yet... On Mon, 11 Jan 2021 at 10:36, Paul Browne <pfb29@xxxxxxxxx> wrote: > Hello all, > > I've been having some real troubles in getting cephadm to apply some very > minor point release updates cleanly, twice now applying the point update of > 15.2.6 -> 15.2.7 and 15.2.7 to 15.2.8 has gotten blocked somewhere and > ended up making no progress, requiring digging deep into internals to > unblock things. > > In the most recent attempt of 15.2.7 -> 15.2.8, the Orchestrator cleanly > replaced Mon and MGR containers in the first steps, but when it came to > replacing Crash daemon containers the running 15.2.7 Crash container was > purged but container update operations then seem to get blocked on trying > to start it again on the older image, leading to an infinite loop of Podman > trying to start a non-existent container in logging; > > https://pastebin.com/9zdMs1XU > > Forcing an `ceph orch daemon rm` of the Crash daemon affected for the host > just repeats the loop again. > > I'd then tried removing the Crash service and all daemons through the > Orchestrator API next. > > This purged all running Crash containers from all hosts, and then > re-applyed a service spec to restart them, hopefully on the new image. > > The Orchestrator removal of the Crash containers seems to have left > container state dangling on hosts however, as now we see the same issue of > Crash containers not starting on *every* host in the cluster due to > left-over container state ; > > https://pastebin.com/tjaegxqg > > At this point I'm not certain if Podman (v1.6.4 EPEL, CentOS7.9) or > Orchestrator is to blame for leaving this state dangling and blocking new > container creation, but it's proving a real problem in applying even simple > minor version point updates. > > Has anyone else been seeing similar behaviour in applying minor version > updates via cephadm+Orchestrator? Are there any good workarounds to clean > up the dangling container state? > > -- > ******************* > Paul Browne > Research Computing Platforms > University Information Services > Roger Needham Building > JJ Thompson Avenue > University of Cambridge > Cambridge > United Kingdom > E-Mail: pfb29@xxxxxxxxx > Tel: 0044-1223-746548 > ******************* > -- ******************* Paul Browne Research Computing Platforms University Information Services Roger Needham Building JJ Thompson Avenue University of Cambridge Cambridge United Kingdom E-Mail: pfb29@xxxxxxxxx Tel: 0044-1223-746548 ******************* _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx