Re: 17.2.2: all MGRs crashing in fresh cephadm install

Carlos Mogas da Silva <r3pek@xxxxxxxxx> · Thu, 28 Jul 2022 16:37:39 +0100



ceph mgr fail did clear up the error on the dashboard saying the mgr is in failed state on ceph01,
ceph orch ps still shows mgr.ceph01.* as errored and doesn't refresh (15h ago)

On Thu, 2022-07-28 at 11:06 -0400, Adam King wrote:
> I've just taken another look at the orch ps output you posted and noticed that the REFRESHED
> column is reporting "62m sgo". That makes it seem like the issue is that cephadm isn't actually
> running its normal operations (it should refresh daemons every 10 minutes by default). I guess
> maybe we should see if it's logged anything that might tell us where it's stuck "ceph log last 200
> cephadm" . To try and get things unstuck, the typical solution is to just run "ceph mgr fail"
> which will start the other mgr as active and put the current active to standby effectively
> "rebooting" cephadm. If it was a transient issue that was causing cephadm to get stuck that would
> resolve it. I think (but I'm not certain) that the dashboard might be getting some of its daemon
> info from cephadm so it being in error there as well might not actually mean much.
> 
> On Thu, Jul 28, 2022 at 10:44 AM Carlos Mogas da Silva <r3pek@xxxxxxxxx> wrote:
> > Yes, cephadm and ceph01 both have mgr's running (the ones with the fix). The "error" is that the
> > ceph01 one is actually
> > running but from "ceph orch"'s perspective, it looks like it's not. Even on the dashboard the
> > daemon
> > shows as errored but it's running (confirmed via podman and systemctl).
> > My take is that something is not communicating some information with "cephadm" but I don't know
> > what. ceph itself knows the mgr is running since it clearly says it's on standby.
> > 
> > 
> > On Wed, 2022-07-27 at 21:09 -0400, Adam King wrote:
> > > What actual hosts are meant to have a mgr here? The naming makes it look as if it thinks
> > > there's a
> > > host "ceph01" and a host "cephadm" and both have 1 mgr. Is that actually correct or is that
> > > aspect
> > > also messed up?
> > > 
> > > Beyond that, you could try manually placing a copy of the cephadm script on each host and
> > > running
> > > "cephadm ls" and see what it gives you. That's how the "ceph orch ps" info is gathered so if
> > > the
> > > output of that looked strange it might tell us something useful.
> > > 
> > > On Wed, Jul 27, 2022 at 8:58 PM Carlos Mogas da Silva <r3pek@xxxxxxxxx> wrote:
> > > > I just build a Ceph cluster and was, unfortunately hit by this :(
> > > > 
> > > > I managed to restart the mgrs (2 of them) by manually editing the
> > > > /var/run/ceph/<cluster>/mgr.<name>/unit.run.
> > > > 
> > > > But now I have a problem that I really don't understand:
> > > > - both managers are running, and appear on "ceph -s" as "mgr: cephadm.mxrhsp(active, since
> > > > 62m),
> > > > standbys: ceph01.fwtity"
> > > > - looks like the orchestrator is a little "confused":
> > > > # ceph orch ps --daemon-type mgr
> > > > NAME                HOST     PORTS        STATUS         REFRESHED  AGE  MEM USE  MEM LIM 
> > > > VERSION 
> > > > IMAGE ID      CONTAINER ID  
> > > > mgr.ceph01.fwtity   ceph01   *:8443,9283  error            62m ago   2h        -        - 
> > > > <unknown>
> > > > <unknown>     <unknown>     
> > > > mgr.cephadm.mxrhsp  cephadm  *:9283       running (63m)    62m ago   2h     437M        - 
> > > > 17.2.2-1-
> > > > gf516549e  5081f5a97849  0f0bc2c6791f
> > > > 
> > > > because of this I can't run a "ceph orch upgrade" because it always complains about having
> > > > only
> > > > one.
> > > > Is there something else that needs to be changed to get the cluster to a normal state?
> > > > 
> > > > Thanks!
> > > > 
> > > > On Wed, 2022-07-27 at 12:23 -0400, Adam King wrote:
> > > > > yeah, that works if there is a working mgr to send the command to. I was
> > > > > assuming here all the mgr daemons were down since it was a fresh cluster so
> > > > > all the mgrs would have this bugged image.
> > > > > 
> > > > > On Wed, Jul 27, 2022 at 12:07 PM Vikhyat Umrao <vikhyat@xxxxxxxxxx> wrote:
> > > > > 
> > > > > > Adam - or we could simply redeploy the daemon with the new image? at least
> > > > > > this is something I did in our testing here[1].
> > > > > > 
> > > > > > ceph orch daemon redeploy mgr.<name> quay.ceph.io/ceph-
> > > > > > ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > > > > > 
> > > > > > [1] https://github.com/ceph/ceph/pull/47270#issuecomment-1196062363
> > > > > > 
> > > > > > On Wed, Jul 27, 2022 at 8:55 AM Adam King <adking@xxxxxxxxxx> wrote:
> > > > > > 
> > > > > > > the unit.image file is just there for cpehadm to look at as part of
> > > > > > > gathering metadata I think. What you'd want to edit is the unit.run file
> > > > > > > (in the same directory as the unit.image). It should have a really long
> > > > > > > line specifying a podman/docker run command and somewhere in there will be
> > > > > > > "CONTAINER_IMAGE=<old-image-name>". You'd need to change that to say
> > > > > > > "CONTAINER_IMAGE=
> > > > > > > quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531" then
> > > > > > > restart the service.
> > > > > > > 
> > > > > > > On Wed, Jul 27, 2022 at 11:46 AM Daniel Schreiber <
> > > > > > > daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > > > 
> > > > > > > > Hi Neha,
> > > > > > > > 
> > > > > > > > thanks for the quick response. Sorry for that stupid question: to use
> > > > > > > > that image I pull the image on the machine and then change
> > > > > > > > /var/lib/ceph/${clusterid}/mgr.${unit}/unit.image and start the service?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Daniel
> > > > > > > > 
> > > > > > > > Am 27.07.22 um 17:23 schrieb Neha Ojha:
> > > > > > > > > Hi Daniel,
> > > > > > > > > 
> > > > > > > > > This issue seems to be showing up in 17.2.2, details in
> > > > > > > > > https://tracker.ceph.com/issues/55304. We are currently in the
> > > > > > > process
> > > > > > > > > of validating the fix https://github.com/ceph/ceph/pull/47270 and
> > > > > > > > > we'll try to expedite a quick fix.
> > > > > > > > > 
> > > > > > > > > In the meantime, we have builds/images of the dev version of the fix,
> > > > > > > > > in case you want to give it a try.
> > > > > > > > > https://shaman.ceph.com/builds/ceph/wip-quincy-libcephsqlite-fix/
> > > > > > > > > quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > > Neha
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > On Wed, Jul 27, 2022 at 8:10 AM Daniel Schreiber
> > > > > > > > > <daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hi,
> > > > > > > > > > 
> > > > > > > > > > I installed a fresh cluster using cephadm:
> > > > > > > > > > 
> > > > > > > > > > - bootstrapped one node
> > > > > > > > > > - extended it using to 3 monitor nodes, each running mon + mgr using
> > > > > > > a
> > > > > > > > > > spec file
> > > > > > > > > > - added 12 OSDs hosts to the spec file with the following disk rules:
> > > > > > > > > > 
> > > > > > > > > > ~~~
> > > > > > > > > > service_type: osd
> > > > > > > > > > service_id: osd_spec_hdd
> > > > > > > > > > placement:
> > > > > > > > > >     label: osd
> > > > > > > > > > spec:
> > > > > > > > > >     data_devices:
> > > > > > > > > >       model: "HGST HUH721212AL" # HDDs
> > > > > > > > > >     db_devices:
> > > > > > > > > >       model: "SAMSUNG MZ7KH1T9" # SATA SSDs
> > > > > > > > > > 
> > > > > > > > > > ---
> > > > > > > > > > 
> > > > > > > > > > service_type: osd
> > > > > > > > > > service_id: osd_spec_nvme
> > > > > > > > > > placement:
> > > > > > > > > >     label: osd
> > > > > > > > > > spec:
> > > > > > > > > >     data_devices:
> > > > > > > > > >       model: "SAMSUNG MZPLL1T6HAJQ-00005" # NVMEs
> > > > > > > > > > ~~~
> > > > > > > > > > 
> > > > > > > > > > OSDs on HDD + SSD were deployed, NVME OSDs were not.
> > > > > > > > > > 
> > > > > > > > > > MGRs crashed, one after the other:
> > > > > > > > > > 
> > > > > > > > > > debug    -65> 2022-07-25T17:06:36.507+0000 7f4a33f80700  5
> > > > > > > cephsqlite:
> > > > > > > > > > FullPathname: (client.17139) 1: /.mgr:devicehealth/main.db
> > > > > > > > > > debug    -64> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0 [dashboard
> > > > > > > > > > INFO sso] Loading SSO DB version=1
> > > > > > > > > > debug    -63> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4 mgr
> > > > > > > get_store
> > > > > > > > > > get_store key: mgr/dashboard/ssodb_v1
> > > > > > > > > > debug    -62> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
> > > > > > > > > > ceph_store_get ssodb_v1 not found
> > > > > > > > > > debug    -61> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0 [dashboard
> > > > > > > > > > INFO root] server: ssl=no host=:: port=8080
> > > > > > > > > > debug    -60> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0 [dashboard
> > > > > > > > > > INFO root] Configured CherryPy, starting engine...
> > > > > > > > > > debug    -59> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4 mgr
> > > > > > > set_uri
> > > > > > > > > > module dashboard set URI 'http://192.168.14.201:8080/'
> > > > > > > > > > debug    -58> 2022-07-25T17:06:36.511+0000 7f4a64e91700  4
> > > > > > > > > > ceph_store_get active_devices not found
> > > > > > > > > > debug    -57> 2022-07-25T17:06:36.511+0000 7f4a33f80700 -1 *** Caught
> > > > > > > > > > signal (Aborted) **
> > > > > > > > > >    in thread 7f4a33f80700 thread_name:devicehealth
> > > > > > > > > >    ceph version 17.2.2 (b6e46b8939c67a6cc754abb4d0ece3c8918eccc3)
> > > > > > > quincy
> > > > > > > > > > (stable)
> > > > > > > > > >    1: /lib64/libpthread.so.0(+0x12ce0) [0x7f4a9b0d0ce0]
> > > > > > > > > >    2: gsignal()
> > > > > > > > > >    3: abort()
> > > > > > > > > >    4: /lib64/libstdc++.so.6(+0x9009b) [0x7f4a9a4cf09b]
> > > > > > > > > >    5: /lib64/libstdc++.so.6(+0x9653c) [0x7f4a9a4d553c]
> > > > > > > > > >    6: /lib64/libstdc++.so.6(+0x96597) [0x7f4a9a4d5597]
> > > > > > > > > >    7: /lib64/libstdc++.so.6(+0x967f8) [0x7f4a9a4d57f8]
> > > > > > > > > >    8: (std::__throw_regex_error(std::regex_constants::error_type,
> > > > > > > char
> > > > > > > > > > const*)+0x4a) [0x5607b31d5eea]
> > > > > > > > > >    9: (bool std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_expression_term<false,
> > > > > > > > > > false>(std::__detail::_Compiler<std::__cxx11::regex>
> > > > > > > > > >    10: (void
> > > > > > > std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_insert_bracket_matcher<false, false>(bool)+0x146)
> > > > > > > > [0x5607b31e26b6]
> > > > > > > > > >    11: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_bracket_expression()+0x6b) [0x5607b31e663b]
> > > > > > > > > >    12: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_atom()+0x6a) [0x5607b31e671a]
> > > > > > > > > >    13: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > > > > >    14: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > > > > >    15: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_atom()+0x338) [0x5607b31e69e8]
> > > > > > > > > >    16: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > > > > >    17: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > > >    18: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > > >    19: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > > >    20: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > > >    21: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > > > > >    22: (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > > >   >::_Compiler(char const*, char const*, std::locale const&,
> > > > > > > > > > std::regex_constants::syn>
> > > > > > > > > >    23: /lib64/libcephsqlite.so(+0x1b7ca) [0x7f4a9d8ba7ca]
> > > > > > > > > >    24: /lib64/libcephsqlite.so(+0x24486) [0x7f4a9d8c3486]
> > > > > > > > > >    25: /lib64/libsqlite3.so.0(+0x75f1c) [0x7f4a9d600f1c]
> > > > > > > > > >    26: /lib64/libsqlite3.so.0(+0xdd4c9) [0x7f4a9d6684c9]
> > > > > > > > > >    27: pysqlite_connection_init()
> > > > > > > > > >    28: /lib64/libpython3.6m.so.1.0(+0x13afc6) [0x7f4a9d182fc6]
> > > > > > > > > >    29: PyObject_Call()
> > > > > > > > > >    30:
> > > > > > > > > > /lib64/python3.6/lib-dynload/_
> > > > > > > sqlite3.cpython-36m-x86_64-linux-gnu.so
> > > > > > > > (+0xa1f5)
> > > > > > > > > > [0x7f4a8bdf31f5]
> > > > > > > > > >    31: /lib64/libpython3.6m.so.1.0(+0x19d5f1) [0x7f4a9d1e55f1]
> > > > > > > > > >    NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> > > > > > > > > > needed to interpret this.
> > > > > > > > > > 
> > > > > > > > > > Is there anything I can do to recover from this? Is there anything I
> > > > > > > can
> > > > > > > > > > add to help debugging this?
> > > > > > > > > > 
> > > > > > > > > > Thank you,
> > > > > > > > > > 
> > > > > > > > > > Daniel
> > > > > > > > > > --
> > > > > > > > > > Daniel Schreiber
> > > > > > > > > > Facharbeitsgruppe Systemsoftware
> > > > > > > > > > Universitaetsrechenzentrum
> > > > > > > > > > 
> > > > > > > > > > Technische Universität Chemnitz
> > > > > > > > > > Straße der Nationen 62 (Raum B303)
> > > > > > > > > > 09111 Chemnitz
> > > > > > > > > > Germany
> > > > > > > > > > 
> > > > > > > > > > Tel:     +49 371 531 35444
> > > > > > > > > > Fax:     +49 371 531 835444
> > > > > > > > > > _______________________________________________
> > > > > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Daniel Schreiber
> > > > > > > > Facharbeitsgruppe Systemsoftware
> > > > > > > > Universitaetsrechenzentrum
> > > > > > > > 
> > > > > > > > Technische Universität Chemnitz
> > > > > > > > Straße der Nationen 62 (Raum B303)
> > > > > > > > 09111 Chemnitz
> > > > > > > > Germany
> > > > > > > > 
> > > > > > > > Tel:     +49 371 531 35444
> > > > > > > > Fax:     +49 371 531 835444
> > > > > > > > _______________________________________________
> > > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > > > 
> > > > > > 
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > 
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > 
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx