Re: 17.2.2: all MGRs crashing in fresh cephadm install

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

thanks, that worked. I deployed the first MGR manually and the others using the orchestrator.

Thank you so much.

Daniel

Am 27.07.22 um 18:23 schrieb Adam King:
yeah, that works if there is a working mgr to send the command to. I was assuming here all the mgr daemons were down since it was a fresh cluster so all the mgrs would have this bugged image.

On Wed, Jul 27, 2022 at 12:07 PM Vikhyat Umrao <vikhyat@xxxxxxxxxx <mailto:vikhyat@xxxxxxxxxx>> wrote:

    Adam - or we could simply redeploy the daemon with the new image? at
    least this is something I did in our testing here[1].

    |ceph orch daemon redeploy mgr.<name>
    quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
    <http://quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531>|

    [1] https://github.com/ceph/ceph/pull/47270#issuecomment-1196062363
    <https://github.com/ceph/ceph/pull/47270#issuecomment-1196062363>

    On Wed, Jul 27, 2022 at 8:55 AM Adam King <adking@xxxxxxxxxx
    <mailto:adking@xxxxxxxxxx>> wrote:

        the unit.image file is just there for cpehadm to look at as part of
        gathering metadata I think. What you'd want to edit is the
        unit.run file
        (in the same directory as the unit.image). It should have a
        really long
        line specifying a podman/docker run command and somewhere in
        there will be
        "CONTAINER_IMAGE=<old-image-name>". You'd need to change that to say
        "CONTAINER_IMAGE=
        quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
        <http://quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531>"
        then
        restart the service.

        On Wed, Jul 27, 2022 at 11:46 AM Daniel Schreiber <
        daniel.schreiber@xxxxxxxxxxxxxxxxxx
        <mailto:daniel.schreiber@xxxxxxxxxxxxxxxxxx>> wrote:

         > Hi Neha,
         >
         > thanks for the quick response. Sorry for that stupid
        question: to use
         > that image I pull the image on the machine and then change
         > /var/lib/ceph/${clusterid}/mgr.${unit}/unit.image and start
        the service?
         >
         > Thanks,
         >
         > Daniel
         >
         > Am 27.07.22 um 17:23 schrieb Neha Ojha:
         > > Hi Daniel,
         > >
         > > This issue seems to be showing up in 17.2.2, details in
         > > https://tracker.ceph.com/issues/55304
        <https://tracker.ceph.com/issues/55304>. We are currently in the
        process
         > > of validating the fix
        https://github.com/ceph/ceph/pull/47270
        <https://github.com/ceph/ceph/pull/47270> and
         > > we'll try to expedite a quick fix.
         > >
         > > In the meantime, we have builds/images of the dev version
        of the fix,
         > > in case you want to give it a try.
         > >
        https://shaman.ceph.com/builds/ceph/wip-quincy-libcephsqlite-fix/ <https://shaman.ceph.com/builds/ceph/wip-quincy-libcephsqlite-fix/>
         > >
        quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531
        <http://quay.ceph.io/ceph-ci/ceph:f516549e3e4815795ff0343ab71b3ebf567e5531>
         > >
         > > Thanks,
         > > Neha
         > >
         > >
         > >
         > > On Wed, Jul 27, 2022 at 8:10 AM Daniel Schreiber
         > > <daniel.schreiber@xxxxxxxxxxxxxxxxxx
        <mailto:daniel.schreiber@xxxxxxxxxxxxxxxxxx>> wrote:
         > >>
         > >> Hi,
         > >>
         > >> I installed a fresh cluster using cephadm:
         > >>
         > >> - bootstrapped one node
         > >> - extended it using to 3 monitor nodes, each running mon +
        mgr using a
         > >> spec file
         > >> - added 12 OSDs hosts to the spec file with the following
        disk rules:
         > >>
         > >> ~~~
         > >> service_type: osd
         > >> service_id: osd_spec_hdd
         > >> placement:
         > >>     label: osd
         > >> spec:
         > >>     data_devices:
         > >>       model: "HGST HUH721212AL" # HDDs
         > >>     db_devices:
         > >>       model: "SAMSUNG MZ7KH1T9" # SATA SSDs
         > >>
         > >> ---
         > >>
         > >> service_type: osd
         > >> service_id: osd_spec_nvme
         > >> placement:
         > >>     label: osd
         > >> spec:
         > >>     data_devices:
         > >>       model: "SAMSUNG MZPLL1T6HAJQ-00005" # NVMEs
         > >> ~~~
         > >>
         > >> OSDs on HDD + SSD were deployed, NVME OSDs were not.
         > >>
         > >> MGRs crashed, one after the other:
         > >>
         > >> debug    -65> 2022-07-25T17:06:36.507+0000 7f4a33f80700  5
        cephsqlite:
         > >> FullPathname: (client.17139) 1: /.mgr:devicehealth/main.db
         > >> debug    -64> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
        [dashboard
         > >> INFO sso] Loading SSO DB version=1
         > >> debug    -63> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
        mgr get_store
         > >> get_store key: mgr/dashboard/ssodb_v1
         > >> debug    -62> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
         > >> ceph_store_get ssodb_v1 not found
         > >> debug    -61> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
        [dashboard
         > >> INFO root] server: ssl=no host=:: port=8080
         > >> debug    -60> 2022-07-25T17:06:36.507+0000 7f4a34f82700  0
        [dashboard
         > >> INFO root] Configured CherryPy, starting engine...
         > >> debug    -59> 2022-07-25T17:06:36.507+0000 7f4a34f82700  4
        mgr set_uri
         > >> module dashboard set URI 'http://192.168.14.201:8080/
        <http://192.168.14.201:8080/>'
         > >> debug    -58> 2022-07-25T17:06:36.511+0000 7f4a64e91700  4
         > >> ceph_store_get active_devices not found
         > >> debug    -57> 2022-07-25T17:06:36.511+0000 7f4a33f80700 -1
        *** Caught
         > >> signal (Aborted) **
         > >>    in thread 7f4a33f80700 thread_name:devicehealth
         > >>    ceph version 17.2.2
        (b6e46b8939c67a6cc754abb4d0ece3c8918eccc3) quincy
         > >> (stable)
         > >>    1: /lib64/libpthread.so.0(+0x12ce0) [0x7f4a9b0d0ce0]
         > >>    2: gsignal()
         > >>    3: abort()
         > >>    4: /lib64/libstdc++.so.6(+0x9009b) [0x7f4a9a4cf09b]
         > >>    5: /lib64/libstdc++.so.6(+0x9653c) [0x7f4a9a4d553c]
         > >>    6: /lib64/libstdc++.so.6(+0x96597) [0x7f4a9a4d5597]
         > >>    7: /lib64/libstdc++.so.6(+0x967f8) [0x7f4a9a4d57f8]
         > >>    8:
        (std::__throw_regex_error(std::regex_constants::error_type, char
         > >> const*)+0x4a) [0x5607b31d5eea]
         > >>    9: (bool
        std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_expression_term<false,
         > >> false>(std::__detail::_Compiler<std::__cxx11::regex>
         > >>    10: (void
        std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_insert_bracket_matcher<false, false>(bool)+0x146)
         > [0x5607b31e26b6]
         > >>    11:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_bracket_expression()+0x6b) [0x5607b31e663b]
         > >>    12:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_atom()+0x6a) [0x5607b31e671a]
         > >>    13:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
         > >>    14:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_disjunction()+0x30) [0x5607b31e6df0]
         > >>    15:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_atom()+0x338) [0x5607b31e69e8]
         > >>    16:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0xd0) [0x5607b31e6ca0]
         > >>    17:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0x42) [0x5607b31e6c12]
         > >>    18:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0x42) [0x5607b31e6c12]
         > >>    19:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0x42) [0x5607b31e6c12]
         > >>    20:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_alternative()+0x42) [0x5607b31e6c12]
         > >>    21:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_M_disjunction()+0x30) [0x5607b31e6df0]
         > >>    22:
        (std::__detail::_Compiler<std::__cxx11::regex_traits<char>
         > >>   >::_Compiler(char const*, char const*, std::locale const&,
         > >> std::regex_constants::syn>
         > >>    23: /lib64/libcephsqlite.so(+0x1b7ca) [0x7f4a9d8ba7ca]
         > >>    24: /lib64/libcephsqlite.so(+0x24486) [0x7f4a9d8c3486]
         > >>    25: /lib64/libsqlite3.so.0(+0x75f1c) [0x7f4a9d600f1c]
         > >>    26: /lib64/libsqlite3.so.0(+0xdd4c9) [0x7f4a9d6684c9]
         > >>    27: pysqlite_connection_init()
         > >>    28: /lib64/libpython3.6m.so.1.0(+0x13afc6) [0x7f4a9d182fc6]
         > >>    29: PyObject_Call()
         > >>    30:
         > >>
        /lib64/python3.6/lib-dynload/_sqlite3.cpython-36m-x86_64-linux-gnu.so
        <http://sqlite3.cpython-36m-x86_64-linux-gnu.so>
         > (+0xa1f5)
         > >> [0x7f4a8bdf31f5]
         > >>    31: /lib64/libpython3.6m.so.1.0(+0x19d5f1) [0x7f4a9d1e55f1]
         > >>    NOTE: a copy of the executable, or `objdump -rdS
        <executable>` is
         > >> needed to interpret this.
         > >>
         > >> Is there anything I can do to recover from this? Is there
        anything I can
         > >> add to help debugging this?
         > >>
         > >> Thank you,
         > >>
         > >> Daniel
         > >> --
         > >> Daniel Schreiber
         > >> Facharbeitsgruppe Systemsoftware
         > >> Universitaetsrechenzentrum
         > >>
         > >> Technische Universität Chemnitz
         > >> Straße der Nationen 62 (Raum B303)
         > >> 09111 Chemnitz
         > >> Germany
         > >>
         > >> Tel:     +49 371 531 35444
         > >> Fax:     +49 371 531 835444
         > >> _______________________________________________
         > >> ceph-users mailing list -- ceph-users@xxxxxxx
        <mailto:ceph-users@xxxxxxx>
         > >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
        <mailto:ceph-users-leave@xxxxxxx>
         > >
         >
         > --
         > Daniel Schreiber
         > Facharbeitsgruppe Systemsoftware
         > Universitaetsrechenzentrum
         >
         > Technische Universität Chemnitz
         > Straße der Nationen 62 (Raum B303)
         > 09111 Chemnitz
         > Germany
         >
         > Tel:     +49 371 531 35444
         > Fax:     +49 371 531 835444
         > _______________________________________________
         > ceph-users mailing list -- ceph-users@xxxxxxx
        <mailto:ceph-users@xxxxxxx>
         > To unsubscribe send an email to ceph-users-leave@xxxxxxx
        <mailto:ceph-users-leave@xxxxxxx>
         >
        _______________________________________________
        ceph-users mailing list -- ceph-users@xxxxxxx
        <mailto:ceph-users@xxxxxxx>
        To unsubscribe send an email to ceph-users-leave@xxxxxxx
        <mailto:ceph-users-leave@xxxxxxx>


--
Daniel Schreiber
Facharbeitsgruppe Systemsoftware
Universitaetsrechenzentrum

Technische Universität Chemnitz
Straße der Nationen 62 (Raum B303)
09111 Chemnitz
Germany

Tel:     +49 371 531 35444
Fax:     +49 371 531 835444
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux