On Wed, Apr 21, 2021 at 7:39 AM Flemming Frandsen <dren.dk@xxxxxxxxx> wrote: > > I tried restarting an MDS server using: systemctl restart > ceph-mds@ardmore.service > > This caused the standby server to enter replay state and the fs started > hanging for several minutes. > > In a slight panic I restarted the other mds server, which was replaced by > the standby server and it almost immediately entered resolve state. While restarting a service/machine is a reasonable practice for a laptop, please resist the urge to do this in a distributed system. You may multiply your problems. > fs dump shows a seq number counting upwards very slowly for the replay'ing > MDS server, I have no idea how far it needs to count: This is a normal heartbeat sequence number. Nothing to be concerned about. > # ceph fs dump > > dumped fsmap epoch 1030314 > e1030314 > enable_multiple, ever_enabled_multiple: 0,0 > compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in sep > arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 1 > > Filesystem 'cephfs' (1) > fs_name cephfs > epoch 1030314 > flags 12 > created 2019-09-09 13:08:26.830927 > modified 2021-04-21 14:04:14.672440 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > min_compat_client -1 (unspecified) > last_failure 0 > last_failure_osd_epoch 13610 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in sep > arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > max_mds 2 > in 0,1 > up {0=10398946,1=10404857} > failed > damaged > stopped > data_pools [1] > metadata_pool 2 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.dalmore{0:10398946} state up:replay seq 215 addr [v2: > 10.0.37.222:6800/2681188441,v1:10.0.37.222:6801/2681188441]] > [mds.cragganmore{1:10404857} state up:resolve seq 201 addr [v2: > 10.0.37.221:6800/871249119,v1:10.0.37.221:6801/871249119]] > > > Standby daemons: > > [mds.ardmore{-1:10408652} state up:standby seq 2 addr [v2: > 10.0.37.223:6800/4096598841,v1:10.0.37.223:6801/4096598841]] > > > Earlier today we added a new OSD host with 12 new OSDs and backfilling is > proceeding as expected: > > cluster: > id: e2007417-a346-4af7-8aa9-4ce8f0d73661 > health: HEALTH_WARN > 1 filesystem is degraded > 1 MDSs behind on trimming > > services: > mon: 3 daemons, quorum cragganmore,dalmore,ardmore (age 5w) > mgr: ardmore(active, since 2w), standbys: dalmore, cragganmore > mds: cephfs:2/2 {0=dalmore=up:replay,1=cragganmore=up:resolve} 1 > up:standby > osd: 69 osds: 69 up (since 102m), 69 in (since 102m); 443 remapped pgs > > rgw: 9 daemons active (ardmore.rgw0, ardmore.rgw1, ardmore.rgw2, > cragganmore.rgw0, cragganmore.rgw1, cragganmore.rgw2, dalmore.rgw0, > dalmore.rgw1, dalmore.rgw2) > > task status: > scrub status: > mds.cragganmore: idle > mds.dalmore: idle > > data: > pools: 13 pools, 1440 pgs > objects: 50.57M objects, 9.0 TiB > usage: 34 TiB used, 37 TiB / 71 TiB avail > pgs: 30195420/151707033 objects misplaced (19.904%) > 997 active+clean > 431 active+remapped+backfill_wait > 12 active+remapped+backfilling > > io: > client: 65 MiB/s rd, 206 KiB/s wr, 17 op/s rd, 8 op/s wr > recovery: 5.5 MiB/s, 23 objects/s > > progress: > Rebalancing after osd.62 marked in > [======================........] > Rebalancing after osd.67 marked in > [===========...................] > Rebalancing after osd.68 marked in > [============..................] > Rebalancing after osd.64 marked in > [=====================.........] > Rebalancing after osd.60 marked in > [====================..........] > Rebalancing after osd.66 marked in > [=============.................] > Rebalancing after osd.63 marked in > [=====================.........] > Rebalancing after osd.61 marked in > [======================........] > Rebalancing after osd.59 marked in > [======================........] > Rebalancing after osd.58 marked in > [========================......] > Rebalancing after osd.57 marked in > [===========================...] > Rebalancing after osd.65 marked in > [==================............] > > > It seems we're running a mix of versions: > > ceph versions > { > "mon": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) > nautilus (stable)": 3 > }, > "mgr": { > "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) > nautilus (stable)": 3 > }, > "osd": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) > nautilus (stable)": 57, > "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) > nautilus (stable)": 12 > }, > "mds": { > "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) > nautilus (stable)": 3 > }, > "rgw": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) > nautilus (stable)": 9 > }, > "overall": { > "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9) > nautilus (stable)": 69, > "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11) > nautilus (stable)": 6, > "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0) > nautilus (stable)": 12 > } > } > > Any hints will be greatly appreciated. It's probably that you have a very large journal (behind on trimming). Did you make any configuration changes to the MDS? You simply need to wait for the up:replay daemon to finish. -- Patrick Donnelly, Ph.D. He / Him / His Principal Software Engineer Red Hat Sunnyvale, CA GPG: 19F28A586F808C2402351B93C3301A3E258DD79D _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx