Re: Issues upgrading cephadm cluster from Octopus.

Seth T Graham <sether@xxxxxxxx> · Tue, 29 Nov 2022 16:45:51 +0000

Thanks for the suggestions. It took me a little bit to get to try it out, but I was able to get the cluster upgraded from Octopus to the latest Pacific. Setting the migration_current value didn't seem to un-wedge anything, but manually setting the registry_credentials key did.

It appears my mistake was using "ceph config set" rather than "ceph config-key set" because once I did that the upgrade was able to finish without error. Rookie mistake.

thanks again

________________________________
From: Adam King <adking@xxxxxxxxxx>
Sent: Saturday, November 19, 2022 7:21 AM
To: Seth T Graham <sether@xxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  Issues upgrading cephadm cluster from Octopus.

will also add since it could help resolve this, there is no "mgr/cephadm/registry_json" config option. The whole reason for moving from the previous 3 options to the new json object was actually to move it from config options that can get spit out in logs to the config-key store where it's a bit more secure, given we're working with a password. If you want to see what it's currently set to you would do a "ceph config-key get mgr/cephadm/registry_credentials" and then for setting it roughly the same but swap get for set and provide the json string as you had been trying to do before.

On Sat, Nov 19, 2022 at 8:05 AM Adam King <adking@xxxxxxxxxx<mailto:adking@xxxxxxxxxx>> wrote:
I don't know for sure if it will fix the issue, but the migrations happen based on a config option "mgr/cephadm/migration_current". You could try setting that back to 0 and it would at least trigger the migrations to happen again after restarting/failing over the mgr. They're meant to be idempotent so in the worst case it just won't accomplish anything. Also, you're correct about it not being in the docs. The migrations were intended to be internal and never require user actions but it appears something has gone wrong in this case.

On Fri, Nov 18, 2022 at 3:06 PM Seth T Graham <sether@xxxxxxxx<mailto:sether@xxxxxxxx>> wrote:
We have a cluster running Octopus (15.2.17) that I need to get updated and am getting cephadm failures when updating the managers, and have tried both Pacific and Quincy with the same results. The cluster was deployed with cephadm on centos stream 8 using podman and due to network isolation of the cluster the images are being pulled from a private registry. When I issue the 'ceph orch upgrade' command it starts out well by updating two of the three managers. When it gets to the point of transitioning to one of the upgraded managers the process stops with an error, with 'ceph status' reporting that the cephadm module has failed.

Digging through the logs, I find a python stack trace that reads:

  File "/usr/share/ceph/mgr/cephadm/module.py", line 587, in serve
    serve.serve()
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 67, in serve
    self.convert_tags_to_repo_digest()
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 974, in convert_tags_to_repo_digest
    self._get_container_image_info(container_image_ref))
  File "/usr/share/ceph/mgr/cephadm/module.py", line 590, in wait_async
    return self.event_loop.get_result(coro)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 48, in get_result
    return asyncio.run_coroutine_threadsafe(coro, self._loop).result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1374, in _get_container_image_info
    await self._registry_login(host, json.loads(str(self.mgr.get_store('registry_credentials'))))
  File "/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Looking through the ceph config there is indeed no setting for the 'registry_credentials' value. Instead I have the registry_password, registry_url and registry_username values that were set when the cluster was provisioned.

I do find mention of this key in the migrations.py script (lives in /usr/share/ceph/mgr/cephadm), under the function 'migrate_4_5' which reads to me like the old keys have been retired in favor of a unified key containing a json object. So I attempted to recreate what that function is doing by setting that key manually but unfortunately this didn't help.

(eg, 'ceph config set mgr mgr/cephadm/registry_credentials '{ "url": "XXX", "username": "XXX", "password": "XXX" }'')

I'm not sure where to go from here. Is there a 'migrate' option I can specify somewhere to properly upgrade this cluster, and perhaps run the code found in migrations.py? I don't see any mention of this in the documentation, but there's a lot of documentation so it's possible I missed it.

Failing that, are there any suggestions for a workaround so I can get this upgrade completed?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
To unsubscribe send an email to ceph-users-leave@xxxxxxx<mailto:ceph-users-leave@xxxxxxx>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx