Re: Upgrade from v15.2.16 to v16.2.7 not starting

"Lo Re Giuseppe" <giuseppe.lore@xxxxxxx> · Thu, 19 May 2022 06:32:38 +0000

Hi,

I didn’t notice anything suspicious in mgr logs, neither in the cephadm.log one (attaching an extract of the latest).
What I have noticed is that one the mgr container, the active one, gets restarted about every 3 minutes.... (as reported by ceph -w)
"""
2022-05-18T15:30:49.883238+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted
2022-05-18T15:30:49.889294+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv
2022-05-18T15:30:50.832200+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available
2022-05-18T15:34:16.979735+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted
2022-05-18T15:34:16.985531+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv
2022-05-18T15:34:18.246784+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available
2022-05-18T15:37:34.576159+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted
2022-05-18T15:37:34.582935+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv
2022-05-18T15:37:35.821200+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available
2022-05-18T15:40:00.000148+0200 mon.naret-monitor01 [INF] overall HEALTH_OK
2022-05-18T15:40:52.456182+0200 mon.naret-monitor01 [INF] Active manager daemon naret-monitor01.tvddjv restarted
2022-05-18T15:40:52.461826+0200 mon.naret-monitor01 [INF] Activating manager daemon naret-monitor01.tvddjv
2022-05-18T15:40:53.787353+0200 mon.naret-monitor01 [INF] Manager daemon naret-monitor01.tvddjv is now available
"""
Attaching also the active mgr proc logs.
The cluster is working fine, but I wonder if this behaviour of mgr/cephadm is itself wrong and might cause the stall of the upgrade.

Thanks,

Giuseppe 

On 18.05.22, 14:19, "Eugen Block" <eblock@xxxxxx> wrote:

    Do you see anything suspicious in /var/log/ceph/cephadm.log? Also  
    check the mgr logs for any hints.

    Zitat von Lo Re  Giuseppe <giuseppe.lore@xxxxxxx>:

    > Hi,
    >
    > We have happily tested the upgrade from v15.2.16 to v16.2.7 with  
    > cephadm on a test cluster made of 3 nodes and everything went  
    > smoothly.
    > Today we started the very same operation on the production one (20  
    > OSD servers, 720 HDDs) and the upgrade process doesn’t do anything  
    > at all…
    >
    > To be more specific, we have issued the command
    >
    > ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.7
    >
    > and soon after “ceph -s” reports
    >
    >     Upgrade to quay.io/ceph/ceph:v16.2.7 (0s)
    >       [............................]
    >
    > But only for few seconds, after that
    >
    > [root@naret-monitor01 ~]# ceph -s
    >   cluster:
    >     id:     63334166-d991-11eb-99de-40a6b72108d0
    >     health: HEALTH_OK
    >
    >   services:
    >     mon: 3 daemons, quorum  
    > naret-monitor01,naret-monitor02,naret-monitor03 (age 7d)
    >     mgr: naret-monitor01.tvddjv(active, since 60s), standbys:  
    > naret-monitor02.btynnb
    >     mds: cephfs:1 {0=cephfs.naret-monitor01.uvevbf=up:active} 2 up:standby
    >     osd: 760 osds: 760 up (since 6d), 760 in (since 2w)
    >     rgw: 3 daemons active (cscs-realm.naret-zone.naret-rgw01.qvhhbi,  
    > cscs-realm.naret-zone.naret-rgw02.pduagk,  
    > cscs-realm.naret-zone.naret-rgw03.aqdkkb)
    >
    >   task status:
    >
    >   data:
    >     pools:   30 pools, 16497 pgs
    >     objects: 833.14M objects, 3.1 PiB
    >     usage:   5.0 PiB used, 5.9 PiB / 11 PiB avail
    >     pgs:     16460 active+clean
    >              37    active+clean+scrubbing+deep
    >
    >   io:
    >     client:   4.7 MiB/s rd, 4.0 MiB/s wr, 122 op/s rd, 47 op/s wr
    >
    >   progress:
    >     Removing image fulen-hdd/c991f6fdf41964 from trash (53s)
    >       [............................] (remaining: 81m)
    >
    >
    >
    > The command “ceph orch upgrade status” says:
    >
    > {
    >     "target_image": "quay.io/ceph/ceph:v16.2.7",
    >     "in_progress": true,
    >     "services_complete": [],
    >     "message": ""
    > }
    >
    > It doesn’t even pull the container image.
    > I have tested that the podman pull command works, I was able to pull  
    > quay.io/ceph/ceph:v16.2.7.
    >
    > “ceph -w” and “ceph -W cephadm” don’t report any activity related to  
    > the upgrade.
    >
    >
    > Does anyone have seen anything similar?
    > Do you have advises on how to understand what’s holding the upgrade  
    > process to actually start?
    >
    > Thanks in advance,
    >
    > Giuseppe
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx

    _______________________________________________
    ceph-users mailing list -- ceph-users@xxxxxxx
    To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx