I tried restarting an MDS server using: systemctl restart ceph-mds@ardmore.service This caused the standby server to enter replay state and the fs started hanging for several minutes. In a slight panic I restarted the other mds server, which was replaced by the standby server and it almost immediately entered resolve state. fs dump shows a seq number counting upwards very slowly for the replay'ing MDS server, I have no idea how far it needs to count: # ceph fs dump dumped fsmap epoch 1030314 e1030314 enable_multiple, ever_enabled_multiple: 0,0 compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in sep arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} legacy client fscid: 1 Filesystem 'cephfs' (1) fs_name cephfs epoch 1030314 flags 12 created 2019-09-09 13:08:26.830927 modified 2021-04-21 14:04:14.672440 tableserver 0 root 0 session_timeout 60 session_autoclose 300 max_file_size 1099511627776 min_compat_client -1 (unspecified) last_failure 0 last_failure_osd_epoch 13610 compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in sep arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2} max_mds 2 in 0,1 up {0=10398946,1=10404857} failed damaged stopped data_pools [1] metadata_pool 2 inline_data disabled balancer standby_count_wanted 1 [mds.dalmore{0:10398946} state up:replay seq 215 addr [v2: 10.0.37.222:6800/2681188441,v1:10.0.37.222:6801/2681188441]] [mds.cragganmore{1:10404857} state up:resolve seq 201 addr [v2: 10.0.37.221:6800/871249119,v1:10.0.37.221:6801/871249119]] Standby daemons: [mds.ardmore{-1:10408652} state up:standby seq 2 addr [v2: 10.0.37.223:6800/4096598841,v1:10.0.37.223:6801/4096598841]] Earlier today we added a new OSD host with 12 new OSDs and backfilling is proceeding as expected: cluster: id: e2007417-a346-4af7-8aa9-4ce8f0d73661 health: HEALTH_WARN 1 filesystem is degraded 1 MDSs behind on trimming services: mon: 3 daemons, quorum cragganmore,dalmore,ardmore (age 5w) mgr: ardmore(active, since 2w), standbys: dalmore, cragganmore mds: cephfs:2/2 {0=dalmore=up:replay,1=cragganmore=up:resolve} 1 up:standby osd: 69 osds: 69 up (since 102m), 69 in (since 102m); 443 remapped pgs rgw: 9 daemons active (ardmore.rgw0, ardmore.rgw1, ardmore.rgw2, cragganmore.rgw0, cragganmore.rgw1, cragganmore.rgw2, dalmore.rgw0, dalmore.rgw1, dalmore.rgw2) task status: scrub status: mds.cragganmore: idle mds.dalmore: idle data: pools: 13 pools, 1440 pgs objects: 50.57M objects, 9.0 TiB usage: 34 TiB used, 37 TiB / 71 TiB avail pgs: 30195420/151707033 objects misplaced (19.904%) 997 active+clean 431 active+remapped+backfill_wait 12 active+remapped+backfilling io: client: 65 MiB/s rd, 206 KiB/s wr, 17 op/s rd, 8 op/s wr recovery: 5.5 MiB/s, 23 objects/s progress: Rebalancing after osd.62 marked in [======================........] Rebalancing after osd.67 marked in [===========...................] Rebalancing after osd.68 marked in [============..................] Rebalancing after osd.64 marked in [=====================.........] Rebalancing after osd.60 marked in [====================..........] Rebalancing after osd.66 marked in [=============.................] Rebalancing after osd.63 marked in [=====================.........] Rebalancing after osd.61 marked in [======================........] Rebalancing after osd.59 marked in [======================........] Rebalancing after osd.58 marked in [========================......] Rebalancing after osd.57 marked in [===========================...] Rebalancing after osd.65 marked in [==================............] It seems we're running a mix of versions: ceph versions { "mon": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 3 }, "mgr": { "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3 }, "osd": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 57, "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)": 12 }, "mds": { "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 3 }, "rgw": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 9 }, "overall": { "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) nautilus (stable)": 69, "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) nautilus (stable)": 6, "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) nautilus (stable)": 12 } } Any hints will be greatly appreciated. -- Flemming Frandsen - YAPH - http://osaa.dk - http://dren.dk/ _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx