Re: Upgrade from 17.2.5 to 17.2.6 stuck at MDS

Thomas Widhalm <widhalmt@xxxxxxxxxxxxx> · Wed, 12 Apr 2023 21:45:13 +0200

Thanks for your detailed explanations! That helped a lot.

All MDS are still in status error. "ceph orch device ls" showed that 
some hosts seem to not have enough space on devices. I wonder why I 
didn't see that in monitoring. Anyway, I'll fix that and then try to 
proceed.

When the backport is available I'll try to upgrade as fast as possible. 
Hopefully that will suffice to restore the system.

Thank you all so far for your great work!

On 4/10/23 23:02, Adam King wrote:
It seems like it maybe didn't actually do the redeploy as it should log 
something saying it's actually doing it on top of the line saying it 
scheduled it. To confirm, the upgrade is paused ("ceph orch upgrade 
status" reports is_paused as false)? If so, maybe try doing a mgr 
failover ("ceph mgr fail") and then check "ceph orch ps"  and "ceph orch 
device ls" a few minutes later and look at the REFRESHED column. If any 
of those are giving amounts of time farther back then when you did the 
failover, there's probably something going on on the host(s) where it 
says it hasn't refreshed recently that's sticking things up (you'd have 
to go on that host and look for hanging cephadm commands). Lastly, you 
could look at the /var/lib/ceph/<fsid>/<mds-daemon-name>/unit.run file 
on the hosts where the mds daemons are deployed. The (very long) last 
podman/docker run line in that file should have the image name of the 
image the daemon is being deployed with. So you could use that to 
confirm if cephadm ever actually tried a redeploy of the mds with the 
new image. You could also check the journal logs for the mds. 
Cephadm reports the sytemd unit name for the daemon as part of "cephadm 
ls" output if you put a copy of the cephadm binary, un "cephadm ls" with 
it, grab the systemd unit name for the mds daemon form that output, you 
could use that to check the journal logs which should tell the last 
restart time and why it's gone down.

On Mon, Apr 10, 2023 at 4:25 PM Thomas Widhalm <widhalmt@xxxxxxxxxxxxx 
<mailto:widhalmt@xxxxxxxxxxxxx>> wrote:

    I did what you told me.

    I also see in the log, that the command went through:

    2023-04-10T19:58:46.522477+0000 mgr.ceph04.qaexpv [INF] Schedule
    redeploy daemon mds.mds01.ceph06.rrxmks
    2023-04-10T20:01:03.360559+0000 mgr.ceph04.qaexpv [INF] Schedule
    redeploy daemon mds.mds01.ceph05.pqxmvt
    2023-04-10T20:01:21.787635+0000 mgr.ceph04.qaexpv [INF] Schedule
    redeploy daemon mds.mds01.ceph07.omdisd

    But the MDS never start. They stay in error state. I tried to redeploy
    and start them a few times. Even restarted one host where a MDS
    should run.

    mds.mds01.ceph03.xqwdjy  ceph03               error           32m ago
    2M        -        -  <unknown>  <unknown>     <unknown>
    mds.mds01.ceph04.hcmvae  ceph04               error           31m ago
    2h        -        -  <unknown>  <unknown>     <unknown>
    mds.mds01.ceph05.pqxmvt  ceph05               error           32m ago
    9M        -        -  <unknown>  <unknown>     <unknown>
    mds.mds01.ceph06.rrxmks  ceph06               error           32m ago
    10w        -        -  <unknown>  <unknown>     <unknown>
    mds.mds01.ceph07.omdisd  ceph07               error           32m ago
    2M        -        -  <unknown>  <unknown>     <unknown>

    And other ideas? Or am I missing something.

    Cheers,
    Thomas

    On 10.04.23 21:53, Adam King wrote:
     > Will also note that the normal upgrade process scales down the mds
     > service to have only 1 mds per fs before upgrading it, so maybe
     > something you'd want to do as well if the upgrade didn't do it
    already.
     > It does so by setting the max_mds to 1 for the fs.
     >
     > On Mon, Apr 10, 2023 at 3:51 PM Adam King <adking@xxxxxxxxxx
    <mailto:adking@xxxxxxxxxx>
     > <mailto:adking@xxxxxxxxxx <mailto:adking@xxxxxxxxxx>>> wrote:
     >
     >     You could try pausing the upgrade and manually "upgrading"
    the mds
     >     daemons by redeploying them on the new image. Something like
    "ceph
     >     orch daemon redeploy <mds-daemon-name> --image <17.2.6 image>"
     >     (daemon names should match those in "ceph orch ps" output).
    If you
     >     do that for all of them and then get them into an up state you
     >     should be able to resume the upgrade and have it complete.
     >
     >     On Mon, Apr 10, 2023 at 3:25 PM Thomas Widhalm
     >     <widhalmt@xxxxxxxxxxxxx <mailto:widhalmt@xxxxxxxxxxxxx>
    <mailto:widhalmt@xxxxxxxxxxxxx <mailto:widhalmt@xxxxxxxxxxxxx>>> wrote:
     >
     >         Hi,
     >
     >         If you remember, I hit bug
    https://tracker.ceph.com/issues/58489
    <https://tracker.ceph.com/issues/58489>
     >         <https://tracker.ceph.com/issues/58489
    <https://tracker.ceph.com/issues/58489>> so I
     >         was very relieved when 17.2.6 was released and started to
    update
     >         immediately.
     >
     >         But now I'm stuck again with my broken MDS. MDS won't get
    into
     >         up:active
     >         without the update but the update waits for them to get into
     >         up:active
     >         state. Seems like a deadlock / chicken-egg problem to me.
     >
     >         Since I'm still relatively new to Ceph, could you help me?
     >
     >         What I see when watching the update status:
     >
     >         {
     >               "target_image":
     >       
      "quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635 <http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635> <http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635 <http://quay.io/ceph/ceph@sha256:1161e35e4e02cf377c93b913ce78773f8413f5a8d7c5eaee4b4773a4f9dd6635>>",
     >               "in_progress": true,
     >               "which": "Upgrading all daemon types on all hosts",
     >               "services_complete": [
     >                   "crash",
     >                   "mgr",
     >                  "mon",
     >                  "osd"
     >               ],
     >               "progress": "18/40 daemons upgraded",
     >               "message": "Error: UPGRADE_OFFLINE_HOST: Upgrade:
    Failed
     >         to connect
     >         to host ceph01 at addr (192.168.23.61)",
     >               "is_paused": false
     >         }
     >
     >         (The offline host was one host that broke during the
    upgrade. I
     >         fixed
     >         that in the meantime and the update went on.)
     >
     >         And in the log:
     >
     >         2023-04-10T19:23:48.750129+0000 mgr.ceph04.qaexpv [INF]
    Upgrade:
     >         Waiting
     >         for mds.mds01.ceph04.hcmvae to be up:active (currently
    up:replay)
     >         2023-04-10T19:23:58.758141+0000 mgr.ceph04.qaexpv [WRN]
    Upgrade:
     >         No mds
     >         is up; continuing upgrade procedure to poke things in the
    right
     >         direction
     >
     >
     >         Please give me a hint what I can do.
     >
     >         Cheers,
     >         Thomas
     >         --
     > http://www.widhalm.or.at <http://www.widhalm.or.at>
    <http://www.widhalm.or.at <http://www.widhalm.or.at>>
     >         GnuPG : 6265BAE6 , A84CB603
     >         Threema: H7AV7D33
     >         Telegram, Signal: widhalmt@xxxxxxxxxxxxx
    <mailto:widhalmt@xxxxxxxxxxxxx>
     >         <mailto:widhalmt@xxxxxxxxxxxxx
    <mailto:widhalmt@xxxxxxxxxxxxx>>
     >         _______________________________________________
     >         ceph-users mailing list -- ceph-users@xxxxxxx
    <mailto:ceph-users@xxxxxxx>
     >         <mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
     >         To unsubscribe send an email to ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>
     >         <mailto:ceph-users-leave@xxxxxxx
    <mailto:ceph-users-leave@xxxxxxx>>
     >

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx