Re: mds stuck in standby, not one active

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 15 Dec 2022 16:55:13 -0500

On Thu, Dec 15, 2022 at 3:17 PM Mevludin Blazevic
<mblazevic@xxxxxxxxxxxxxx> wrote:
>
> Ceph fs dump:
>
> e62
> enable_multiple, ever_enabled_multiple: 1,1
> default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}legacy client fscid: 1
>
> Filesystem 'ceph_fs' (1)
> fs_name ceph_fs
> epoch   62
> flags   12
> created 2022-11-28T12:05:17.203346+0000
> modified        2022-12-15T12:09:14.091724+0000
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> required_client_features        {}
> last_failure    0
> last_failure_osd_epoch  196035
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 1
> in      0
> up      {}
> failed  0
> damaged
> stopped
> data_pools      [4]
> metadata_pool   5
> inline_data     disabled
> balancer
> standby_count_wanted    1
>
> f
> Standby daemons:
>
> [mds.ceph_fs.store5.gnlqqm{-1:152180029} state up:standby seq 1
> join_fscid=1 addr
> [v2:192.168.50.135:6800/3548272808,v1:192.168.50.135:6801/3548272808]
> compat {c=[1],r=[1],i=[1]}]
> [mds.ceph_fs.store6.fxgvoj{-1:152416137} state up:standby seq 1
> join_fscid=1 addr
> [v2:192.168.50.136:7024/1339959968,v1:192.168.50.136:7025/1339959968]
> compat {c=[1],r=[1],i=[1]}]
> [mds.ceph_fs.store4.mhvpot{-1:152477853} state up:standby seq 1
> join_fscid=1 addr
> [v2:192.168.50.134:6800/3098669884,v1:192.168.50.134:6801/3098669884]
> compat {c=[1],r=[1],i=[1]}]
> [mds.ceph_fs.store3.vcnwzh{-1:152481783} state up:standby seq 1
> join_fscid=1 addr
> [v2:192.168.50.133:6800/77378788,v1:192.168.50.133:6801/77378788] compat
> {c=[1],r=[1],i=[1]}]
> dumped fsmap epoch 62
>
> Ceph Status:
>
>    cluster:
>      id:     8c774934-1535-11ec-973e-525400130e4f
>      health: HEALTH_ERR
>              1 filesystem is degraded
>              1 filesystem has a failed mds daemon
>              1 filesystem is offline
>              26 daemons have recently crashed
>
>    services:
>      mon: 2 daemons, quorum cephadm-vm,store2 (age 2d)
>      mgr: store1.uevcpd(active, since 2d), standbys: cephadm-vm.zwagng
>      mds: 0/1 daemons up (1 failed), 4 standby
>      osd: 312 osds: 312 up (since 8h), 312 in (since 17h)
>
>    data:
>      volumes: 0/1 healthy, 1 failed
>      pools:   7 pools, 289 pgs
>      objects: 2.62M objects, 9.8 TiB
>      usage:   29 TiB used, 1.9 PiB / 1.9 PiB avail
>      pgs:     286 active+clean
>               3   active+clean+scrubbing+deep
>
>    io:
>      client:   945 KiB/s rd, 3.3 MiB/s wr, 516 op/s rd, 562 op/s wr
>
> Ceph Health detail:
>
> HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds
> daemon; 1 filesystem is offline; 26 daemons have recently crashed
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>      fs ceph_fs is degraded
> [WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
>      fs ceph_fs has 1 failed mds
> [ERR] MDS_ALL_DOWN: 1 filesystem is offline
>      fs ceph_fs is offline because no MDS is active for it.
> [WRN] RECENT_CRASH: 26 daemons have recently crashed
>      osd.323 crashed on host store7 at 2022-12-12T14:03:23.857874Z
>      osd.323 crashed on host store7 at 2022-12-12T14:03:43.945625Z
>      osd.323 crashed on host store7 at 2022-12-12T14:04:03.282797Z
>      osd.323 crashed on host store7 at 2022-12-12T14:04:22.612037Z
>      osd.323 crashed on host store7 at 2022-12-12T14:04:41.630473Z
>      osd.323 crashed on host store7 at 2022-12-12T14:34:49.237008Z
>      osd.323 crashed on host store7 at 2022-12-12T14:35:09.903922Z
>      osd.323 crashed on host store7 at 2022-12-12T14:35:28.621955Z
>      osd.323 crashed on host store7 at 2022-12-12T14:35:46.985517Z
>      osd.323 crashed on host store7 at 2022-12-12T14:36:05.375758Z
>      osd.323 crashed on host store7 at 2022-12-12T15:01:57.235785Z
>      osd.323 crashed on host store7 at 2022-12-12T15:02:16.581335Z
>      osd.323 crashed on host store7 at 2022-12-12T15:02:33.212653Z
>      osd.323 crashed on host store7 at 2022-12-12T15:02:49.775560Z
>      osd.323 crashed on host store7 at 2022-12-12T15:03:06.303861Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:21:41.149773Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:22:15.413105Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:23:39.888401Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:27:56.458529Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:31:03.791532Z
>      mgr.cephadm-vm.zwagng crashed on host cephadm-vm at
> 2022-12-13T13:34:24.023106Z
>      osd.98 crashed on host store3 at 2022-12-13T16:11:38.064735Z
>      mgr.store1.uevcpd crashed on host store1 at 2022-12-13T18:39:33.091261Z
>      osd.322 crashed on host store6 at 2022-12-14T06:06:14.193437Z
>      osd.234 crashed on host store8 at 2022-12-15T02:32:13.009795Z
>      osd.311 crashed on host store8 at 2022-12-15T02:32:18.407978Z
>
> As suggested I was going to upgrade the ceph cluster to 16.2.7 to fix
> the mds issue, but it seems none of the running standby daemons is
> responding.

Suggest also looking at the cephadm logs which may explain how it's stuck:

https://docs.ceph.com/en/quincy/cephadm/operations/#watching-cephadm-log-messages

Except that your MDS daemons have not been upgraded, I don't see a
problem from the CephFS side. You can try removing the daemons, it
probably can't make things worse :)

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat, Inc.
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx