Re: 17.2.2: all MGRs crashing in fresh cephadm install

Adam King <adking@xxxxxxxxxx> · Wed, 27 Jul 2022 21:09:23 -0400

What actual hosts are meant to have a mgr here? The naming makes it look as
if it thinks there's a host "ceph01" and a host "cephadm" and both have 1
mgr. Is that actually correct or is that aspect also messed up?

Beyond that, you could try manually placing a copy of the cephadm script on
each host and running "cephadm ls" and see what it gives you. That's how
the "ceph orch ps" info is gathered so if the output of that looked strange
it might tell us something useful.

On Wed, Jul 27, 2022 at 8:58 PM Carlos Mogas da Silva <r3pek@xxxxxxxxx>
wrote:

> I just build a Ceph cluster and was, unfortunately hit by this :(
>
> I managed to restart the mgrs (2 of them) by manually editing the
> /var/run/ceph/<cluster>/mgr.<name>/unit.run.
>
> But now I have a problem that I really don't understand:
> - both managers are running, and appear on "ceph -s" as "mgr:
> cephadm.mxrhsp(active, since 62m),
> standbys: ceph01.fwtity"
> - looks like the orchestrator is a little "confused":
> # ceph orch ps --daemon-type mgr
> NAME                HOST     PORTS        STATUS         REFRESHED  AGE
> MEM USE  MEM LIM  VERSION
> IMAGE ID      CONTAINER ID
> mgr.ceph01.fwtity   ceph01   *:8443,9283  error            62m ago   2h
>     -        -  <unknown>
> <unknown>     <unknown>
> mgr.cephadm.mxrhsp  cephadm  *:9283       running (63m)    62m ago   2h
>  437M        -  17.2.2-1-
> gf516549e  5081f5a97849  0f0bc2c6791f
>
> because of this I can't run a "ceph orch upgrade" because it always
> complains about having only one.
> Is there something else that needs to be changed to get the cluster to a
> normal state?
>
> Thanks!
>
> On Wed, 2022-07-27 at 12:23 -0400, Adam King wrote:
> > yeah, that works if there is a working mgr to send the command to. I was
> > assuming here all the mgr daemons were down since it was a fresh cluster
> so
> > all the mgrs would have this bugged image.
> >
> > On Wed, Jul 27, 2022 at 12:07 PM Vikhyat Umrao <vikhyat@xxxxxxxxxx>
> wrote:
> >
> > > Adam - or we could simply redeploy the daemon with the new image? at
> least
> > > this is something I did in our testing here[1].
> > >
> > > ceph orch daemon redeploy mgr.<name> quay.ceph.io/ceph-
> > > ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > >
> > > [1] https://github.com/ceph/ceph/pull/47270#issuecomment-1196062363
> > >
> > > On Wed, Jul 27, 2022 at 8:55 AM Adam King <adking@xxxxxxxxxx> wrote:
> > >
> > > > the unit.image file is just there for cpehadm to look at as part of
> > > > gathering metadata I think. What you'd want to edit is the unit.run
> file
> > > > (in the same directory as the unit.image). It should have a really
> long
> > > > line specifying a podman/docker run command and somewhere in there
> will be
> > > > "CONTAINER_IMAGE=<old-image-name>". You'd need to change that to say
> > > > "CONTAINER_IMAGE=
> > > > quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531"
> then
> > > > restart the service.
> > > >
> > > > On Wed, Jul 27, 2022 at 11:46 AM Daniel Schreiber <
> > > > daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > > Hi Neha,
> > > > >
> > > > > thanks for the quick response. Sorry for that stupid question: to
> use
> > > > > that image I pull the image on the machine and then change
> > > > > /var/lib/ceph/${clusterid}/mgr.${unit}/unit.image and start the
> service?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Daniel
> > > > >
> > > > > Am 27.07.22 um 17:23 schrieb Neha Ojha:
> > > > > > Hi Daniel,
> > > > > >
> > > > > > This issue seems to be showing up in 17.2.2, details in
> > > > > > https://tracker.ceph.com/issues/55304. We are currently in the
> > > > process
> > > > > > of validating the fix https://github.com/ceph/ceph/pull/47270
>  and
> > > > > > we'll try to expedite a quick fix.
> > > > > >
> > > > > > In the meantime, we have builds/images of the dev version of the
> fix,
> > > > > > in case you want to give it a try.
> > > > > >
> https://shaman.ceph.com/builds/ceph/wip-quincy-libcephsqlite-fix/
> > > > > >
> quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
> > > > > >
> > > > > > Thanks,
> > > > > > Neha
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Wed, Jul 27, 2022 at 8:10 AM Daniel Schreiber
> > > > > > <daniel.schreiber@xxxxxxxxxxxxxxxxxx> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I installed a fresh cluster using cephadm:
> > > > > > >
> > > > > > > - bootstrapped one node
> > > > > > > - extended it using to 3 monitor nodes, each running mon + mgr
> using
> > > > a
> > > > > > > spec file
> > > > > > > - added 12 OSDs hosts to the spec file with the following disk
> rules:
> > > > > > >
> > > > > > > ~~~
> > > > > > > service_type: osd
> > > > > > > service_id: osd_spec_hdd
> > > > > > > placement:
> > > > > > >     label: osd
> > > > > > > spec:
> > > > > > >     data_devices:
> > > > > > >       model: "HGST HUH721212AL" # HDDs
> > > > > > >     db_devices:
> > > > > > >       model: "SAMSUNG MZ7KH1T9" # SATA SSDs
> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > > service_type: osd
> > > > > > > service_id: osd_spec_nvme
> > > > > > > placement:
> > > > > > >     label: osd
> > > > > > > spec:
> > > > > > >     data_devices:
> > > > > > >       model: "SAMSUNG MZPLL1T6HAJQ-00005" # NVMEs
> > > > > > > ~~~
> > > > > > >
> > > > > > > OSDs on HDD + SSD were deployed, NVME OSDs were not.
> > > > > > >
> > > > > > > MGRs crashed, one after the other:
> > > > > > >
> > > > > > > debug    -65> 2022-07-25T17:06:36.507+0000 7f4a33f80700  5
> > > > cephsqlite:
> > > > > > > FullPathname: (client.17139) 1: /.mgr:devicehealth/main.db
> > > > > > > debug    -64> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > INFO sso] Loading SSO DB version=1
> > > > > > > debug    -63> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4 mgr
> > > > get_store
> > > > > > > get_store key: mgr/dashboard/ssodb_v1
> > > > > > > debug    -62> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
> > > > > > > ceph_store_get ssodb_v1 not found
> > > > > > > debug    -61> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > INFO root] server: ssl=no host=:: port=8080
> > > > > > > debug    -60> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
> [dashboard
> > > > > > > INFO root] Configured CherryPy, starting engine...
> > > > > > > debug    -59> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4 mgr
> > > > set_uri
> > > > > > > module dashboard set URI 'http://192.168.14.201:8080/'
> > > > > > > debug    -58> 2022-07-25T17:06:36.511+0000 7f4a64e91700  4
> > > > > > > ceph_store_get active_devices not found
> > > > > > > debug    -57> 2022-07-25T17:06:36.511+0000 7f4a33f80700 -1 ***
> Caught
> > > > > > > signal (Aborted) **
> > > > > > >    in thread 7f4a33f80700 thread_name:devicehealth
> > > > > > >    ceph version 17.2.2
> (b6e46b8939c67a6cc754abb4d0ece3c8918eccc3)
> > > > quincy
> > > > > > > (stable)
> > > > > > >    1: /lib64/libpthread.so.0(+0x12ce0) [0x7f4a9b0d0ce0]
> > > > > > >    2: gsignal()
> > > > > > >    3: abort()
> > > > > > >    4: /lib64/libstdc++.so.6(+0x9009b) [0x7f4a9a4cf09b]
> > > > > > >    5: /lib64/libstdc++.so.6(+0x9653c) [0x7f4a9a4d553c]
> > > > > > >    6: /lib64/libstdc++.so.6(+0x96597) [0x7f4a9a4d5597]
> > > > > > >    7: /lib64/libstdc++.so.6(+0x967f8) [0x7f4a9a4d57f8]
> > > > > > >    8:
> (std::__throw_regex_error(std::regex_constants::error_type,
> > > > char
> > > > > > > const*)+0x4a) [0x5607b31d5eea]
> > > > > > >    9: (bool
> std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_expression_term<false,
> > > > > > > false>(std::__detail::_Compiler<std::__cxx11::regex>
> > > > > > >    10: (void
> > > > std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_insert_bracket_matcher<false, false>(bool)+0x146)
> > > > > [0x5607b31e26b6]
> > > > > > >    11:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_bracket_expression()+0x6b) [0x5607b31e663b]
> > > > > > >    12:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_atom()+0x6a) [0x5607b31e671a]
> > > > > > >    13:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > >    14:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > >    15:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_atom()+0x338) [0x5607b31e69e8]
> > > > > > >    16:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
> > > > > > >    17:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > >    18:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > >    19:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > >    20:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_alternative()+0x42) [0x5607b31e6c12]
> > > > > > >    21:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_M_disjunction()+0x30) [0x5607b31e6df0]
> > > > > > >    22:
> (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
> > > > > > >   >::_Compiler(char const*, char const*, std::locale const&,
> > > > > > > std::regex_constants::syn>
> > > > > > >    23: /lib64/libcephsqlite.so(+0x1b7ca) [0x7f4a9d8ba7ca]
> > > > > > >    24: /lib64/libcephsqlite.so(+0x24486) [0x7f4a9d8c3486]
> > > > > > >    25: /lib64/libsqlite3.so.0(+0x75f1c) [0x7f4a9d600f1c]
> > > > > > >    26: /lib64/libsqlite3.so.0(+0xdd4c9) [0x7f4a9d6684c9]
> > > > > > >    27: pysqlite_connection_init()
> > > > > > >    28: /lib64/libpython3.6m.so.1.0(+0x13afc6) [0x7f4a9d182fc6]
> > > > > > >    29: PyObject_Call()
> > > > > > >    30:
> > > > > > > /lib64/python3.6/lib-dynload/_
> > > > sqlite3.cpython-36m-x86_64-linux-gnu.so
> > > > > (+0xa1f5)
> > > > > > > [0x7f4a8bdf31f5]
> > > > > > >    31: /lib64/libpython3.6m.so.1.0(+0x19d5f1) [0x7f4a9d1e55f1]
> > > > > > >    NOTE: a copy of the executable, or `objdump -rdS
> <executable>` is
> > > > > > > needed to interpret this.
> > > > > > >
> > > > > > > Is there anything I can do to recover from this? Is there
> anything I
> > > > can
> > > > > > > add to help debugging this?
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > Daniel
> > > > > > > --
> > > > > > > Daniel Schreiber
> > > > > > > Facharbeitsgruppe Systemsoftware
> > > > > > > Universitaetsrechenzentrum
> > > > > > >
> > > > > > > Technische Universität Chemnitz
> > > > > > > Straße der Nationen 62 (Raum B303)
> > > > > > > 09111 Chemnitz
> > > > > > > Germany
> > > > > > >
> > > > > > > Tel:     +49 371 531 35444
> > > > > > > Fax:     +49 371 531 835444
> > > > > > > _______________________________________________
> > > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > >
> > > > >
> > > > > --
> > > > > Daniel Schreiber
> > > > > Facharbeitsgruppe Systemsoftware
> > > > > Universitaetsrechenzentrum
> > > > >
> > > > > Technische Universität Chemnitz
> > > > > Straße der Nationen 62 (Raum B303)
> > > > > 09111 Chemnitz
> > > > > Germany
> > > > >
> > > > > Tel:     +49 371 531 35444
> > > > > Fax:     +49 371 531 835444
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > >
> > > > _______________________________________________
> > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > >
> > >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx