Interesting. So every node was rebooted after packages were upgraded. I noticed that Proxmox lists 3 of the MDS version as 16.2.7 and 2 of them have nothing in the verison string. `ceph mds versions` shows 3 MDSes on 16.2.7 and...nothing else. After running `ceph mds fail clusterfs-performance:0`, an additional MDS began showing version 16.2.7 but I've still got one showing empty. I simply restarted the MDS that was still weird and then the version showed up. Finally, I was able to remove mon_mds_skip_sanity from my config and restart each monitor. Looks like they came up ok, thanks! Ilya Kogan w: github.com/ikogan e: ikogan@xxxxxxxxxxxxx <http://twitter.com/ilkogan> <https://www.linkedin.com/in/ilyakogan/> On Thu, Dec 23, 2021 at 2:16 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote: > Hi Ilya, > > On Thu, Dec 23, 2021 at 1:08 PM Ilya Kogan <ikogan@xxxxxxxxxxxxx> wrote: > > > > Hi, > > > > I had originally asked the question on the pull request here: > > https://github.com/ceph/ceph/pull/44131 and was asked to continue the > > discussion on this list. > > > > Last night I upgraded my cluster from 16.2.5 (I think) to 16.2.7. > > Unfortunately, thinking this was a minor patch, I failed to read the > > upgrade instructions so I did not set "mon_mds_skip_sanity = true" before > > upgrading and I'm not using cephadm. As a result, the monitor on my first > > node crashed after upgrade. I then discovered the recommendation, set the > > flag, and the monitor started. Once the monitor was up, I removed the > flag > > from the config, restarted it, and it crashed again. Thinking maybe the > > upgrade wasn't fully complete _or_ I had to upgrade all of my monitors > > first, I went ahead and upgraded the rest of the cluster with the flag > set > > and let it sit overnight. > > [...] > > e3719 > > enable_multiple, ever_enabled_multiple: 1,1 > > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > > writeable ranges,3=default file layouts on dirs,4=dir inode in separate > > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses > > inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > > legacy client fscid: 4 > > > > Filesystem 'clusterfs-capacity' (4) > > fs_name clusterfs-capacity > > epoch 3719 > > flags 12 > > created 2021-04-25T20:22:33.434467-0400 > > modified 2021-12-23T04:00:59.962394-0500 > > tableserver 0 > > root 0 > > session_timeout 60 > > session_autoclose 300 > > max_file_size 1099511627776 > > required_client_features {} > > last_failure 0 > > last_failure_osd_epoch 265418 > > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds > > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline > > data,8=no anchor table,9=file layout v2,10=snaprealm v2} > > max_mds 1 > > in 0 > > up {0=448184627} > > failed > > damaged > > stopped > > data_pools [18] > > metadata_pool 19 > > inline_data disabled > > balancer > > standby_count_wanted 1 > > [mds.ceph2{0:448184627} state up:active seq 5103 addr [v2: > > 10.10.0.1:6800/4098373857,v1:10.10.0.1:6801/4098373857] compat > > {c=[1],r=[1],i=[7ff]}] > > > > > > Filesystem 'clusterfs-performance' (5) > > fs_name clusterfs-performance > > epoch 3710 > > flags 12 > > created 2021-09-07T19:50:05.174359-0400 > > modified 2021-12-22T21:40:43.813907-0500 > > tableserver 0 > > root 0 > > session_timeout 60 > > session_autoclose 300 > > max_file_size 1099511627776 > > required_client_features {} > > last_failure 0 > > last_failure_osd_epoch 261954 > > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds > > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline > > data,8=no anchor table,9=file layout v2,10=snaprealm v2} > > max_mds 1 > > in 0 > > up {0=448071169} > > failed > > damaged > > stopped > > data_pools [28] > > metadata_pool 27 > > inline_data disabled > > balancer > > standby_count_wanted 1 > > [mds.ceph4{0:448071169} state up:active seq 2852 addr [v2: > > 10.10.0.3:6800/368534096,v1:10.10.0.3:6801/368534096] compat > > {c=[1],r=[1],i=[77f]}] > > > > > > Standby daemons: > > > > [mds.ceph3{-1:448024430} state up:standby seq 1 addr [v2: > > 10.10.0.2:6800/535990846,v1:10.10.0.2:6801/535990846] compat > > {c=[1],r=[1],i=[77f]}] > > [mds.ceph5{-1:448169888} state up:standby seq 1 addr [v2: > > 10.10.0.20:6800/1349962329,v1:10.10.0.20:6801/1349962329] compat > > {c=[1],r=[1],i=[77f]}] > > [mds.ceph1{-1:448321175} state up:standby seq 1 addr [v2: > > 10.10.0.32:6800/881362738,v1:10.10.0.32:6801/881362738] compat > > {c=[1],r=[1],i=[7ff]}] > > The "i=[77f]" indicates to me this may be an MDS older than 16.2.7. > This should not otherwise be possible. > > In any case, I'm not exactly sure how this happened to your cluster > but the fix should be: > > > ceph mds fail clusterfs-performance:0 > > That should cause the mons to select the one standby mds with > "i=[7ff]" for replacement. Confirm your MDS are all upgraded > otherwise. > > Once that's done you should be able to remove the config. > > -- > Patrick Donnelly, Ph.D. > He / Him / His > Principal Software Engineer > Red Hat, Inc. > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx