Yup, looks exactly like the problem described in that thread. For the next cluster I’ll try without the _admin labels. I think I added them because without the label the files in /etc/ceph ever so often “magically” disappeared. So I’ll save copies of them before doing the upgrade. Thanks, Uli > On 27. 04 2022, at 12:17, Kuo Gene <genekuo@xxxxxxxxxxxxxx> wrote: > > Hi, > > There’s previous discussion about this issue. > > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/ZZZGEVD4L4SEMEBMTZYDW2OOVWQFXGOA/ > > Can you check if there is any host with _admin label on it? Removing the label works for me. > > Regards, > Gene Kuo > >> On Apr 27, 2022, at 18:21, Ulrich Klein <Ulrich.Klein@xxxxxxxxxxxxxx> wrote: >> >> Hi, >> >> Yesterday I upgraded my smallest test system, 4 Raspberries 4B, from Pacific 16.2.7 (cephadm/containerized) to 17.2.0 using >> ceph orch upgrade start --ceph-version 17.2.0 >> >> It mostly worked ok, but wouldn't have finished without manual intervention. >> Apparently each time a mgr is upgraded the process creates new /etc/ceph/ceph.conf and /etc/ceph/ceph.client.admin.keyring files on all nodes. >> To do that it looks like it first copies the files to /tmp/etc/ceph/ceph.conf on the node, then changes owwner and permission and then tries to move the file into place. Unfortunately it changes owner/permission in way so that it doesn't have permission to write to and move the file resulting in somethig like this in an infinite (?) loop: >> >> 2022-04-27T09:03:45.032808+0000 mgr.ceph00.lpaijp (mgr.2314108) 605 : cephadm [ERR] executing refresh((['ceph00', 'ceph01', 'ceph02', 'ceph03'],)) failed. >> Traceback (most recent call last): >> File "/usr/share/ceph/mgr/cephadm/ssh.py", line 221, in _write_remote_file >> await asyncssh.scp(f.name, (conn, tmp_path)) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp >> await source.run(srcpath) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 458, in run >> self.handle_error(exc) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error >> raise exc from None >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in run >> await self._send_files(path, b'') >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files >> self.handle_error(exc) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error >> raise exc from None >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files >> await self._send_file(srcpath, dstpath, attrs) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file >> await self._make_cd_request(b'C', attrs, size, srcpath) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request >> self._fs.basename(path)) >> File "/lib/python3.6/site-packages/asyncssh/scp.py", line 224, in make_request >> raise exc >> asyncssh.sftp.SFTPFailure: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied >> >> During handling of the above exception, another exception occurred: >> >> Traceback (most recent call last): >> File "/usr/share/ceph/mgr/cephadm/utils.py", line 76, in do_work >> return f(*arg) >> File "/usr/share/ceph/mgr/cephadm/serve.py", line 265, in refresh >> self._write_client_files(client_files, host) >> File "/usr/share/ceph/mgr/cephadm/serve.py", line 1052, in _write_client_files >> self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) >> File "/usr/share/ceph/mgr/cephadm/ssh.py", line 238, in write_remote_file >> host, path, content, mode, uid, gid, addr)) >> File "/usr/share/ceph/mgr/cephadm/module.py", line 569, in wait_async >> return self.event_loop.get_result(coro) >> File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result >> return asyncio.run_coroutine_threadsafe(coro, self._loop).result() >> File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result >> return self.__get_result() >> File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result >> raise self._exception >> File "/usr/share/ceph/mgr/cephadm/ssh.py", line 226, in _write_remote_file >> raise OrchestratorError(msg) >> orchestrator._interface.OrchestratorError: Unable to write ceph02:/etc/ceph/ceph.conf: scp: /tmp/etc/ceph/ceph.conf.new: Permission denied >> >> >> On each node I had to do >> cd /usr/bin >> mv chmod chmod_real ; ln -s true chmod >> mv chown chown_real ; ln -s true chown >> >> And then whenever the file(s) appeared: >> chmod_real 666 /tmp/etc/ceph/ceph.conf.new >> >> to make it get over that hurdle. And once finished restore the chown/chmod binaries and permissions. >> I wonder if anyone else has seen that on Intel/AMD machines? Looks like a pretty obvious problem with the process shooting itself in the permission foot, and on bigger clusters that process would be a time consuming pain. >> >> Ciao, Uli >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx