Re: 17.2.2: all MGRs crashing in fresh cephadm install

Adam King <adking@xxxxxxxxxx> · Thu, 28 Jul 2022 11:06:44 -0400

I've just taken another look at the orch ps output you posted and noticed
that the REFRESHED column is reporting "62m sgo". That makes it seem like
the issue is that cephadm isn't actually running its normal operations (it
should refresh daemons every 10 minutes by default). I guess maybe we
should see if it's logged anything that might tell us where it's stuck
"ceph log last 200 cephadm" . To try and get things unstuck, the typical
solution is to just run "ceph mgr fail" which will start the other mgr as
active and put the current active to standby effectively "rebooting"
cephadm. If it was a transient issue that was causing cephadm to get stuck
that would resolve it. I think (but I'm not certain) that the dashboard
might be getting some of its daemon info from cephadm so it being in error
there as well might not actually mean much.

On Thu, Jul 28, 2022 at 10:44 AM Carlos Mogas da Silva <r3pek@xxxxxxxxx>
wrote:

> Yes, cephadm and ceph01 both have mgr's running (the ones with the fix).
> The "error" is that the
> ceph01 one is actually
> running but from "ceph orch"'s perspective, it looks like it's not. Even
> on the dashboard the daemon
> shows as errored but it's running (confirmed via podman and systemctl).
> My take is that something is not communicating some information with
> "cephadm" but I don't know
> what. ceph itself knows the mgr is running since it clearly says it's on
> standby.
>
>
> On Wed, 2022-07-27 at 21:09 -0400, Adam King wrote:
> > What actual hosts are meant to have a mgr here? The naming makes it look
> as if it thinks there's a
> > host "ceph01" and a host "cephadm" and both have 1 mgr. Is that actually
> correct or is that aspect
> > also messed up?
> >
> > Beyond that, you could try manually placing a copy of the cephadm script
> on each host and running
> > "cephadm ls" and see what it gives you. That's how the "ceph orch ps"
> info is gathered so if the
> > output of that looked strange it might tell us something useful.
> >
> > On Wed, Jul 27, 2022 at 8:58 PM Carlos Mogas da Silva <r3pek@xxxxxxxxx>
> wrote:
> > > I just build a Ceph cluster and was, unfortunately hit by this :(
> > >
> > > I managed to restart the mgrs (2 of them) by manually editing the
> > > /var/run/ceph/<cluster>/mgr.<name>/unit.run.
> > >
> > > But now I have a problem that I really don't understand:
> > > - both managers are running, and appear on "ceph -s" as "mgr:
> cephadm.mxrhsp(active, since 62m),
> > > standbys: ceph01.fwtity"
> > > - looks like the orchestrator is a little "confused":
> > > # ceph orch ps --daemon-type mgr
> > > NAME                HOST     PORTS        STATUS         REFRESHED
> AGE  MEM USE  MEM LIM
> > > VERSION
> > > IMAGE ID      CONTAINER ID
> > > mgr.ceph01.fwtity   ceph01   *:8443,9283  error            62m ago
>  2h        -        -
> > > <unknown>
> > > <unknown>     <unknown>
> > > mgr.cephadm.mxrhsp  cephadm  *:9283       running (63m)    62m ago
>  2h     437M        -
> > > 17.2.2-1-
> > > gf516549e  5081f5a97849  0f0bc2c6791f
> > >
> > > because of this I can't run a "ceph orch upgrade" because it always
> complains about having only
> > > one.
> > > Is there something else that needs to be changed to get the cluster to
> a normal state?
> > >
> > > Thanks!
> > >
> > > On Wed, 2022-07-27 at 12:23 -0400, Adam King wrote:
> > > > yeah, that works if there is a working mgr to send the command to. I
> was
> > > > assuming here all the mgr daemons were down since it was a fresh
> cluster so
> > > > all the mgrs would have this bugged image.
> > > >
> > > > On Wed, Jul 27, 2022 at 12:07 PM Vikhyat Umrao <vikhyat@xxxxxxxxxx>
> wrote:
> > > >
> > > > > Adam - or we could simply redeploy the daemon with the new image?
> at least
> > > > > this is something I did in our testing here[1].
> > > > >
> > > > > ceph orch daemon redeploy mgr.<name> quay.ceph.io/ceph-
> > > > > ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > > > >
> > > > > [1]
> https://github.com/ceph/ceph/pull/47270#issuecomment-1196062363
> > > > >
> > > > > On Wed, Jul 27, 2022 at 8:55 AM Adam King <adking@xxxxxxxxxx>
> wrote:
> > > > >
> > > > > > the unit.image file is just there for cpehadm to look at as part
> of
> > > > > > gathering metadata I think. What you'd want to edit is the
> unit.run file
> > > > > > (in the same directory as the unit.image). It should have a
> really long
> > > > > > line specifying a podman/docker run command and somewhere in
> there will be
> > > > > > "CONTAINER_IMAGE=<old-image-name>". You'd need to change that to
> say
> > > > > > "CONTAINER_IMAGE=
> > > > > >
> quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531" then
> > > > > > restart the service.
> > > > > >
> > > > > > On Wed, Jul 27, 2022 at 11:46 AM Daniel Schreiber <
> > > > > > daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > >
> > > > > > > Hi Neha,
> > > > > > >
> > > > > > > thanks for the quick response. Sorry for that stupid question:
> to use
> > > > > > > that image I pull the image on the machine and then change
> > > > > > > /var/lib/ceph/${clusterid}/mgr.${unit}/unit.image and start
> the service?
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Daniel
> > > > > > >
> > > > > > > Am 27.07.22 um 17:23 schrieb Neha Ojha:
> > > > > > > > Hi Daniel,
> > > > > > > >
> > > > > > > > This issue seems to be showing up in 17.2.2, details in
> > > > > > > > https://tracker.ceph.com/issues/55304. We are currently in
> the
> > > > > > process
> > > > > > > > of validating the fix
> https://github.com/ceph/ceph/pull/47270 and
> > > > > > > > we'll try to expedite a quick fix.
> > > > > > > >
> > > > > > > > In the meantime, we have builds/images of the dev version of
> the fix,
> > > > > > > > in case you want to give it a try.
> > > > > > > >
> https://shaman.ceph.com/builds/ceph/wip-quincy-libcephsqlite-fix/
> > > > > > > >
> quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Neha
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Jul 27, 2022 at 8:10 AM Daniel Schreiber
> > > > > > > > <daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > > > > >
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > I installed a fresh cluster using cephadm:
> > > > > > > > >
> > > > > > > > > - bootstrapped one node
> > > > > > > > > - extended it using to 3 monitor nodes, each running mon +
> mgr using
> > > > > > a
> > > > > > > > > spec file
> > > > > > > > > - added 12 OSDs hosts to the spec file with the following
> disk rules:
> > > > > > > > >
> > > > > > > > > ~~~
> > > > > > > > > service_type: osd
> > > > > > > > > service_id: osd_spec_hdd
> > > > > > > > > placement:
> > > > > > > > >     label: osd
> > > > > > > > > spec:
> > > > > > > > >     data_devices:
> > > > > > > > >       model: "HGST HUH721212AL" # HDDs
> > > > > > > > >     db_devices:
> > > > > > > > >       model: "SAMSUNG MZ7KH1T9" # SATA SSDs
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > service_type: osd
> > > > > > > > > service_id: osd_spec_nvme
> > > > > > > > > placement:
> > > > > > > > >     label: osd
> > > > > > > > > spec:
> > > > > > > > >     data_devices:
> > > > > > > > >       model: "SAMSUNG MZPLL1T6HAJQ-00005" # NVMEs
> > > > > > > > > ~~~
> > > > > > > > >
> > > > > > > > > OSDs on HDD + SSD were deployed, NVME OSDs were not.
> > > > > > > > >
> > > > > > > > > MGRs crashed, one after the other:
> > > > > > > > >
> > > > > > > > > debug    -65> 2022-07-25T17:06:36.507+0000 7f4a33f80700  5
> > > > > > cephsqlite:
> > > > > > > > > FullPathname: (client.17139) 1: /.mgr:devicehealth/main.db
> > > > > > > > > debug    -64> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > > > INFO sso] Loading SSO DB version=1
> > > > > > > > > debug    -63> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
> mgr
> > > > > > get_store
> > > > > > > > > get_store key: mgr/dashboard/ssodb_v1
> > > > > > > > > debug    -62> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
> > > > > > > > > ceph_store_get ssodb_v1 not found
> > > > > > > > > debug    -61> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > > > INFO root] server: ssl=no host=:: port=8080
> > > > > > > > > debug    -60> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > > > INFO root] Configured CherryPy, starting engine...
> > > > > > > > > debug    -59> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
> mgr
> > > > > > set_uri
> > > > > > > > > module dashboard set URI 'http://192.168.14.201:8080/'
> > > > > > > > > debug    -58> 2022-07-25T17:06:36.511+0000 7f4a64e91700  4
> > > > > > > > > ceph_store_get active_devices not found
> > > > > > > > > debug    -57> 2022-07-25T17:06:36.511+0000 7f4a33f80700 -1
> *** Caught
> > > > > > > > > signal (Aborted) **
> > > > > > > > >    in thread 7f4a33f80700 thread_name:devicehealth
> > > > > > > > >    ceph version 17.2.2
> (b6e46b8939c67a6cc754abb4d0ece3c8918eccc3)
> > > > > > quincy
> > > > > > > > > (stable)
> > > > > > > > >    1: /lib64/libpthread.so.0(+0x12ce0) [0x7f4a9b0d0ce0]
> > > > > > > > >    2: gsignal()
> > > > > > > > >    3: abort()
> > > > > > > > >    4: /lib64/libstdc++.so.6(+0x9009b) [0x7f4a9a4cf09b]
> > > > > > > > >    5: /lib64/libstdc++.so.6(+0x9653c) [0x7f4a9a4d553c]
> > > > > > > > >    6: /lib64/libstdc++.so.6(+0x96597) [0x7f4a9a4d5597]
> > > > > > > > >    7: /lib64/libstdc++.so.6(+0x967f8) [0x7f4a9a4d57f8]
> > > > > > > > >    8:
> (std::__throw_regex_error(std::regex_constants::error_type,
> > > > > > char
> > > > > > > > > const*)+0x4a) [0x5607b31d5eea]
> > > > > > > > >    9: (bool
> std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_expression_term<false,
> > > > > > > > > false>(std::__detail::_Compiler<std::__cxx11::regex>
> > > > > > > > >    10: (void
> > > > > > std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_insert_bracket_matcher<false, false>(bool)+0x146)
> > > > > > > [0x5607b31e26b6]
> > > > > > > > >    11:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_bracket_expression()+0x6b) [0x5607b31e663b]
> > > > > > > > >    12:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_atom()+0x6a) [0x5607b31e671a]
> > > > > > > > >    13:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > > > >    14:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > > > >    15:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_atom()+0x338) [0x5607b31e69e8]
> > > > > > > > >    16:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > > > >    17:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > >    18:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > >    19:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > >    20:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > > > >    21:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > > > >    22:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > > > >   >::_Compiler(char const*, char const*, std::locale
> const&,
> > > > > > > > > std::regex_constants::syn>
> > > > > > > > >    23: /lib64/libcephsqlite.so(+0x1b7ca) [0x7f4a9d8ba7ca]
> > > > > > > > >    24: /lib64/libcephsqlite.so(+0x24486) [0x7f4a9d8c3486]
> > > > > > > > >    25: /lib64/libsqlite3.so.0(+0x75f1c) [0x7f4a9d600f1c]
> > > > > > > > >    26: /lib64/libsqlite3.so.0(+0xdd4c9) [0x7f4a9d6684c9]
> > > > > > > > >    27: pysqlite_connection_init()
> > > > > > > > >    28: /lib64/libpython3.6m.so.1.0(+0x13afc6)
> [0x7f4a9d182fc6]
> > > > > > > > >    29: PyObject_Call()
> > > > > > > > >    30:
> > > > > > > > > /lib64/python3.6/lib-dynload/_
> > > > > > sqlite3.cpython-36m-x86_64-linux-gnu.so
> > > > > > > (+0xa1f5)
> > > > > > > > > [0x7f4a8bdf31f5]
> > > > > > > > >    31: /lib64/libpython3.6m.so.1.0(+0x19d5f1)
> [0x7f4a9d1e55f1]
> > > > > > > > >    NOTE: a copy of the executable, or `objdump -rdS
> <executable>` is
> > > > > > > > > needed to interpret this.
> > > > > > > > >
> > > > > > > > > Is there anything I can do to recover from this? Is there
> anything I
> > > > > > can
> > > > > > > > > add to help debugging this?
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > Daniel
> > > > > > > > > --
> > > > > > > > > Daniel Schreiber
> > > > > > > > > Facharbeitsgruppe Systemsoftware
> > > > > > > > > Universitaetsrechenzentrum
> > > > > > > > >
> > > > > > > > > Technische Universität Chemnitz
> > > > > > > > > Straße der Nationen 62 (Raum B303)
> > > > > > > > > 09111 Chemnitz
> > > > > > > > > Germany
> > > > > > > > >
> > > > > > > > > Tel:     +49 371 531 35444
> > > > > > > > > Fax:     +49 371 531 835444
> > > > > > > > > _______________________________________________
> > > > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Daniel Schreiber
> > > > > > > Facharbeitsgruppe Systemsoftware
> > > > > > > Universitaetsrechenzentrum
> > > > > > >
> > > > > > > Technische Universität Chemnitz
> > > > > > > Straße der Nationen 62 (Raum B303)
> > > > > > > 09111 Chemnitz
> > > > > > > Germany
> > > > > > >
> > > > > > > Tel:     +49 371 531 35444
> > > > > > > Fax:     +49 371 531 835444
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > > >
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > >
> > > > >
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx