Re: Issue Upgrading to 16.2.7 related to mon_mds_skip_sanity.

Ilya Kogan <ikogan@xxxxxxxxxxxxx> · Thu, 23 Dec 2021 14:33:32 -0500

Interesting. So every node was rebooted after packages were upgraded. I
noticed that Proxmox lists 3 of the MDS version as 16.2.7 and 2 of them
have nothing in the verison string. `ceph mds versions` shows 3 MDSes on
16.2.7 and...nothing else. After running `ceph mds fail
clusterfs-performance:0`, an additional MDS began showing version 16.2.7
but I've still got one showing empty.

I simply restarted the MDS that was still weird and then the version showed
up. Finally, I was able to remove mon_mds_skip_sanity from my config and
restart each monitor. Looks like they came up ok, thanks!

Ilya Kogan
w: github.com/ikogan   e:  ikogan@xxxxxxxxxxxxx
  <http://twitter.com/ilkogan>    <https://www.linkedin.com/in/ilyakogan/>

On Thu, Dec 23, 2021 at 2:16 PM Patrick Donnelly <pdonnell@xxxxxxxxxx>
wrote:

> Hi Ilya,
>
> On Thu, Dec 23, 2021 at 1:08 PM Ilya Kogan <ikogan@xxxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > I had originally asked the question on the pull request here:
> > https://github.com/ceph/ceph/pull/44131 and was asked to continue the
> > discussion on this list.
> >
> > Last night I upgraded my cluster from 16.2.5 (I think) to 16.2.7.
> > Unfortunately, thinking this was a minor patch, I failed to read the
> > upgrade instructions so I did not set "mon_mds_skip_sanity = true" before
> > upgrading and I'm not using cephadm. As a result, the monitor on my first
> > node crashed after upgrade. I then discovered the recommendation, set the
> > flag, and the monitor started. Once the monitor was up, I removed the
> flag
> > from the config, restarted it, and it crashed again. Thinking maybe the
> > upgrade wasn't fully complete _or_ I had to upgrade all of my monitors
> > first, I went ahead and upgraded the rest of the cluster with the flag
> set
> > and let it sit overnight.
> > [...]
> > e3719
> > enable_multiple, ever_enabled_multiple: 1,1
> > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> > writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds
> uses
> > inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > legacy client fscid: 4
> >
> > Filesystem 'clusterfs-capacity' (4)
> > fs_name clusterfs-capacity
> > epoch 3719
> > flags 12
> > created 2021-04-25T20:22:33.434467-0400
> > modified 2021-12-23T04:00:59.962394-0500
> > tableserver 0
> > root 0
> > session_timeout 60
> > session_autoclose 300
> > max_file_size 1099511627776
> > required_client_features {}
> > last_failure 0
> > last_failure_osd_epoch 265418
> > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in 0
> > up {0=448184627}
> > failed
> > damaged
> > stopped
> > data_pools [18]
> > metadata_pool 19
> > inline_data disabled
> > balancer
> > standby_count_wanted 1
> > [mds.ceph2{0:448184627} state up:active seq 5103 addr [v2:
> > 10.10.0.1:6800/4098373857,v1:10.10.0.1:6801/4098373857] compat
> > {c=[1],r=[1],i=[7ff]}]
> >
> >
> > Filesystem 'clusterfs-performance' (5)
> > fs_name clusterfs-performance
> > epoch 3710
> > flags 12
> > created 2021-09-07T19:50:05.174359-0400
> > modified 2021-12-22T21:40:43.813907-0500
> > tableserver 0
> > root 0
> > session_timeout 60
> > session_autoclose 300
> > max_file_size 1099511627776
> > required_client_features {}
> > last_failure 0
> > last_failure_osd_epoch 261954
> > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in 0
> > up {0=448071169}
> > failed
> > damaged
> > stopped
> > data_pools [28]
> > metadata_pool 27
> > inline_data disabled
> > balancer
> > standby_count_wanted 1
> > [mds.ceph4{0:448071169} state up:active seq 2852 addr [v2:
> > 10.10.0.3:6800/368534096,v1:10.10.0.3:6801/368534096] compat
> > {c=[1],r=[1],i=[77f]}]
> >
> >
> > Standby daemons:
> >
> > [mds.ceph3{-1:448024430} state up:standby seq 1 addr [v2:
> > 10.10.0.2:6800/535990846,v1:10.10.0.2:6801/535990846] compat
> > {c=[1],r=[1],i=[77f]}]
> > [mds.ceph5{-1:448169888} state up:standby seq 1 addr [v2:
> > 10.10.0.20:6800/1349962329,v1:10.10.0.20:6801/1349962329] compat
> > {c=[1],r=[1],i=[77f]}]
> > [mds.ceph1{-1:448321175} state up:standby seq 1 addr [v2:
> > 10.10.0.32:6800/881362738,v1:10.10.0.32:6801/881362738] compat
> > {c=[1],r=[1],i=[7ff]}]
>
> The "i=[77f]" indicates to me this may be an MDS older than 16.2.7.
> This should not otherwise be possible.
>
> In any case, I'm not exactly sure how this happened to your cluster
> but the fix should be:
>
> > ceph mds fail clusterfs-performance:0
>
> That should cause the mons to select the one standby mds with
> "i=[7ff]" for replacement. Confirm your MDS are all upgraded
> otherwise.
>
> Once that's done you should be able to remove the config.
>
> --
> Patrick Donnelly, Ph.D.
> He / Him / His
> Principal Software Engineer
> Red Hat, Inc.
> GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx