Re: mds stuck in standby, not one active

Mevludin Blazevic <mblazevic@xxxxxxxxxxxxxx> · Thu, 15 Dec 2022 21:16:58 +0100

Ceph fs dump:

e62
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
writeable ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no 
anchor table,9=file layout v2,10=snaprealm v2}legacy client fscid: 1

Filesystem 'ceph_fs' (1)
fs_name ceph_fs
epoch   62
flags   12
created 2022-11-28T12:05:17.203346+0000
modified        2022-12-15T12:09:14.091724+0000
tableserver     0
root    0
session_timeout 60
session_autoclose       300
max_file_size   1099511627776
required_client_features        {}
last_failure    0
last_failure_osd_epoch  196035
compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no 
anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in      0
up      {}
failed  0
damaged
stopped
data_pools      [4]
metadata_pool   5
inline_data     disabled
balancer
standby_count_wanted    1

f
Standby daemons:

[mds.ceph_fs.store5.gnlqqm{-1:152180029} state up:standby seq 1 
join_fscid=1 addr 
[v2:192.168.50.135:6800/3548272808,v1:192.168.50.135:6801/3548272808] 
compat {c=[1],r=[1],i=[1]}]
[mds.ceph_fs.store6.fxgvoj{-1:152416137} state up:standby seq 1 
join_fscid=1 addr 
[v2:192.168.50.136:7024/1339959968,v1:192.168.50.136:7025/1339959968] 
compat {c=[1],r=[1],i=[1]}]
[mds.ceph_fs.store4.mhvpot{-1:152477853} state up:standby seq 1 
join_fscid=1 addr 
[v2:192.168.50.134:6800/3098669884,v1:192.168.50.134:6801/3098669884] 
compat {c=[1],r=[1],i=[1]}]
[mds.ceph_fs.store3.vcnwzh{-1:152481783} state up:standby seq 1 
join_fscid=1 addr 
[v2:192.168.50.133:6800/77378788,v1:192.168.50.133:6801/77378788] compat 
{c=[1],r=[1],i=[1]}]
dumped fsmap epoch 62

Ceph Status:

  cluster:
    id:     8c774934-1535-11ec-973e-525400130e4f
    health: HEALTH_ERR
            1 filesystem is degraded
            1 filesystem has a failed mds daemon
            1 filesystem is offline
            26 daemons have recently crashed

  services:
    mon: 2 daemons, quorum cephadm-vm,store2 (age 2d)
    mgr: store1.uevcpd(active, since 2d), standbys: cephadm-vm.zwagng
    mds: 0/1 daemons up (1 failed), 4 standby
    osd: 312 osds: 312 up (since 8h), 312 in (since 17h)

  data:
    volumes: 0/1 healthy, 1 failed
    pools:   7 pools, 289 pgs
    objects: 2.62M objects, 9.8 TiB
    usage:   29 TiB used, 1.9 PiB / 1.9 PiB avail
    pgs:     286 active+clean
             3   active+clean+scrubbing+deep

  io:
    client:   945 KiB/s rd, 3.3 MiB/s wr, 516 op/s rd, 562 op/s wr

Ceph Health detail:

HEALTH_ERR 1 filesystem is degraded; 1 filesystem has a failed mds 
daemon; 1 filesystem is offline; 26 daemons have recently crashed
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs ceph_fs is degraded
[WRN] FS_WITH_FAILED_MDS: 1 filesystem has a failed mds daemon
    fs ceph_fs has 1 failed mds
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
    fs ceph_fs is offline because no MDS is active for it.
[WRN] RECENT_CRASH: 26 daemons have recently crashed
    osd.323 crashed on host store7 at 2022-12-12T14:03:23.857874Z
    osd.323 crashed on host store7 at 2022-12-12T14:03:43.945625Z
    osd.323 crashed on host store7 at 2022-12-12T14:04:03.282797Z
    osd.323 crashed on host store7 at 2022-12-12T14:04:22.612037Z
    osd.323 crashed on host store7 at 2022-12-12T14:04:41.630473Z
    osd.323 crashed on host store7 at 2022-12-12T14:34:49.237008Z
    osd.323 crashed on host store7 at 2022-12-12T14:35:09.903922Z
    osd.323 crashed on host store7 at 2022-12-12T14:35:28.621955Z
    osd.323 crashed on host store7 at 2022-12-12T14:35:46.985517Z
    osd.323 crashed on host store7 at 2022-12-12T14:36:05.375758Z
    osd.323 crashed on host store7 at 2022-12-12T15:01:57.235785Z
    osd.323 crashed on host store7 at 2022-12-12T15:02:16.581335Z
    osd.323 crashed on host store7 at 2022-12-12T15:02:33.212653Z
    osd.323 crashed on host store7 at 2022-12-12T15:02:49.775560Z
    osd.323 crashed on host store7 at 2022-12-12T15:03:06.303861Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:21:41.149773Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:22:15.413105Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:23:39.888401Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:27:56.458529Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:31:03.791532Z
    mgr.cephadm-vm.zwagng crashed on host cephadm-vm at 
2022-12-13T13:34:24.023106Z
    osd.98 crashed on host store3 at 2022-12-13T16:11:38.064735Z
    mgr.store1.uevcpd crashed on host store1 at 2022-12-13T18:39:33.091261Z
    osd.322 crashed on host store6 at 2022-12-14T06:06:14.193437Z
    osd.234 crashed on host store8 at 2022-12-15T02:32:13.009795Z
    osd.311 crashed on host store8 at 2022-12-15T02:32:18.407978Z

As suggested I was going to upgrade the ceph cluster to 16.2.7 to fix 
the mds issue, but it seems none of the running standby daemons is 
responding.

Am 15.12.2022 um 19:08 schrieb Patrick Donnelly:
On Thu, Dec 15, 2022 at 7:24 AM Mevludin Blazevic
<mblazevic@xxxxxxxxxxxxxx> wrote:
Hi,

while upgrading to ceph pacific 6.2.7, the upgrade process stuck exactly
at the mds daemons. Before, I have tried to increase/shrink the
placement size of them, but nothing happens. Currently I have 4/3
running daemons. One daemon should be stopped and removed.

Do you suggest to force remove these daemons or what could be the
preferred workaround?
Hard to say without more information. Please share:

ceph fs dump
ceph status
ceph health detail

--
Mevludin Blazevic, M.Sc.

University of Koblenz-Landau
Computing Centre (GHRKO)
Universitaetsstrasse 1
D-56070 Koblenz, Germany
Room A023
Tel: +49 261/287-1326

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx