Re: Issue Upgrading to 16.2.7 related to mon_mds_skip_sanity.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Ilya,

On Thu, Dec 23, 2021 at 1:08 PM Ilya Kogan <ikogan@xxxxxxxxxxxxx> wrote:
>
> Hi,
>
> I had originally asked the question on the pull request here:
> https://github.com/ceph/ceph/pull/44131 and was asked to continue the
> discussion on this list.
>
> Last night I upgraded my cluster from 16.2.5 (I think) to 16.2.7.
> Unfortunately, thinking this was a minor patch, I failed to read the
> upgrade instructions so I did not set "mon_mds_skip_sanity = true" before
> upgrading and I'm not using cephadm. As a result, the monitor on my first
> node crashed after upgrade. I then discovered the recommendation, set the
> flag, and the monitor started. Once the monitor was up, I removed the flag
> from the config, restarted it, and it crashed again. Thinking maybe the
> upgrade wasn't fully complete _or_ I had to upgrade all of my monitors
> first, I went ahead and upgraded the rest of the cluster with the flag set
> and let it sit overnight.
> [...]
> e3719
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds uses
> inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 4
>
> Filesystem 'clusterfs-capacity' (4)
> fs_name clusterfs-capacity
> epoch 3719
> flags 12
> created 2021-04-25T20:22:33.434467-0400
> modified 2021-12-23T04:00:59.962394-0500
> tableserver 0
> root 0
> session_timeout 60
> session_autoclose 300
> max_file_size 1099511627776
> required_client_features {}
> last_failure 0
> last_failure_osd_epoch 265418
> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in 0
> up {0=448184627}
> failed
> damaged
> stopped
> data_pools [18]
> metadata_pool 19
> inline_data disabled
> balancer
> standby_count_wanted 1
> [mds.ceph2{0:448184627} state up:active seq 5103 addr [v2:
> 10.10.0.1:6800/4098373857,v1:10.10.0.1:6801/4098373857] compat
> {c=[1],r=[1],i=[7ff]}]
>
>
> Filesystem 'clusterfs-performance' (5)
> fs_name clusterfs-performance
> epoch 3710
> flags 12
> created 2021-09-07T19:50:05.174359-0400
> modified 2021-12-22T21:40:43.813907-0500
> tableserver 0
> root 0
> session_timeout 60
> session_autoclose 300
> max_file_size 1099511627776
> required_client_features {}
> last_failure 0
> last_failure_osd_epoch 261954
> compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in 0
> up {0=448071169}
> failed
> damaged
> stopped
> data_pools [28]
> metadata_pool 27
> inline_data disabled
> balancer
> standby_count_wanted 1
> [mds.ceph4{0:448071169} state up:active seq 2852 addr [v2:
> 10.10.0.3:6800/368534096,v1:10.10.0.3:6801/368534096] compat
> {c=[1],r=[1],i=[77f]}]
>
>
> Standby daemons:
>
> [mds.ceph3{-1:448024430} state up:standby seq 1 addr [v2:
> 10.10.0.2:6800/535990846,v1:10.10.0.2:6801/535990846] compat
> {c=[1],r=[1],i=[77f]}]
> [mds.ceph5{-1:448169888} state up:standby seq 1 addr [v2:
> 10.10.0.20:6800/1349962329,v1:10.10.0.20:6801/1349962329] compat
> {c=[1],r=[1],i=[77f]}]
> [mds.ceph1{-1:448321175} state up:standby seq 1 addr [v2:
> 10.10.0.32:6800/881362738,v1:10.10.0.32:6801/881362738] compat
> {c=[1],r=[1],i=[7ff]}]

The "i=[77f]" indicates to me this may be an MDS older than 16.2.7.
This should not otherwise be possible.

In any case, I'm not exactly sure how this happened to your cluster
but the fix should be:

> ceph mds fail clusterfs-performance:0

That should cause the mons to select the one standby mds with
"i=[7ff]" for replacement. Confirm your MDS are all upgraded
otherwise.

Once that's done you should be able to remove the config.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux