Re: MDS replay takes forever and cephfs is down

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Wed, 21 Apr 2021 07:56:57 -0700

On Wed, Apr 21, 2021 at 7:39 AM Flemming Frandsen <dren.dk@xxxxxxxxx> wrote:
>
> I tried restarting an MDS server using: systemctl restart
> ceph-mds@ardmore.service
>
> This caused the standby server to enter replay state and the fs started
> hanging for several minutes.
>
> In a slight panic I restarted the other mds server, which was replaced by
> the standby server and it almost immediately entered resolve state.

While restarting a service/machine is a reasonable practice for a
laptop, please resist the urge to do this in a distributed system. You
may multiply your problems.

> fs dump shows a seq number counting upwards very slowly for the replay'ing
> MDS server, I have no idea how far it needs to count:

This is a normal heartbeat sequence number. Nothing to be concerned about.

> # ceph fs dump
>
> dumped fsmap epoch 1030314
> e1030314
> enable_multiple, ever_enabled_multiple: 0,0
> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in sep
> arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> legacy client fscid: 1
>
> Filesystem 'cephfs' (1)
> fs_name cephfs
> epoch   1030314
> flags   12
> created 2019-09-09 13:08:26.830927
> modified        2021-04-21 14:04:14.672440
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> min_compat_client       -1 (unspecified)
> last_failure    0
> last_failure_osd_epoch  13610
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in sep
> arate object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> max_mds 2
> in      0,1
> up      {0=10398946,1=10404857}
> failed
> damaged
> stopped
> data_pools      [1]
> metadata_pool   2
> inline_data     disabled
> balancer
> standby_count_wanted    1
> [mds.dalmore{0:10398946} state up:replay seq 215 addr [v2:
> 10.0.37.222:6800/2681188441,v1:10.0.37.222:6801/2681188441]]
> [mds.cragganmore{1:10404857} state up:resolve seq 201 addr [v2:
> 10.0.37.221:6800/871249119,v1:10.0.37.221:6801/871249119]]
>
>
> Standby daemons:
>
> [mds.ardmore{-1:10408652} state up:standby seq 2 addr [v2:
> 10.0.37.223:6800/4096598841,v1:10.0.37.223:6801/4096598841]]
>
>
> Earlier today we added a new OSD host with 12 new OSDs and backfilling is
> proceeding as expected:
>
>  cluster:
>    id:     e2007417-a346-4af7-8aa9-4ce8f0d73661
>    health: HEALTH_WARN
>            1 filesystem is degraded
>            1 MDSs behind on trimming
>
>  services:
>    mon: 3 daemons, quorum cragganmore,dalmore,ardmore (age 5w)
>    mgr: ardmore(active, since 2w), standbys: dalmore, cragganmore
>    mds: cephfs:2/2 {0=dalmore=up:replay,1=cragganmore=up:resolve} 1
> up:standby
>    osd: 69 osds: 69 up (since 102m), 69 in (since 102m); 443 remapped pgs
>
>    rgw: 9 daemons active (ardmore.rgw0, ardmore.rgw1, ardmore.rgw2,
> cragganmore.rgw0, cragganmore.rgw1, cragganmore.rgw2, dalmore.rgw0,
> dalmore.rgw1, dalmore.rgw2)
>
>  task status:
>    scrub status:
>        mds.cragganmore: idle
>        mds.dalmore: idle
>
>  data:
>    pools:   13 pools, 1440 pgs
>    objects: 50.57M objects, 9.0 TiB
>    usage:   34 TiB used, 37 TiB / 71 TiB avail
>    pgs:     30195420/151707033 objects misplaced (19.904%)
>             997 active+clean
>             431 active+remapped+backfill_wait
>             12  active+remapped+backfilling
>
>  io:
>    client:   65 MiB/s rd, 206 KiB/s wr, 17 op/s rd, 8 op/s wr
>    recovery: 5.5 MiB/s, 23 objects/s
>
>  progress:
>    Rebalancing after osd.62 marked in
>      [======================........]
>    Rebalancing after osd.67 marked in
>      [===========...................]
>    Rebalancing after osd.68 marked in
>      [============..................]
>    Rebalancing after osd.64 marked in
>      [=====================.........]
>    Rebalancing after osd.60 marked in
>      [====================..........]
>    Rebalancing after osd.66 marked in
>      [=============.................]
>    Rebalancing after osd.63 marked in
>      [=====================.........]
>    Rebalancing after osd.61 marked in
>      [======================........]
>    Rebalancing after osd.59 marked in
>      [======================........]
>    Rebalancing after osd.58 marked in
>      [========================......]
>    Rebalancing after osd.57 marked in
>      [===========================...]
>    Rebalancing after osd.65 marked in
>      [==================............]
>
>
> It seems we're running a mix of versions:
>
> ceph versions
> {
>    "mon": {
>        "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 3
>    },
>    "mgr": {
>        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 3
>    },
>    "osd": {
>        "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 57,
>        "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0)
> nautilus (stable)": 12
>    },
>    "mds": {
>        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 3
>    },
>    "rgw": {
>        "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 9
>    },
>    "overall": {
>        "ceph version 14.2.18 (befbc92f3c11eedd8626487211d200c0b44786d9)
> nautilus (stable)": 69,
>        "ceph version 14.2.19 (bb796b9b5bab9463106022eef406373182465d11)
> nautilus (stable)": 6,
>        "ceph version 14.2.20 (36274af6eb7f2a5055f2d53ad448f2694e9046a0)
> nautilus (stable)": 12
>    }
> }
>
> Any hints will be greatly appreciated.

It's probably that you have a very large journal (behind on trimming).
Did you make any configuration changes to the MDS? You simply need to
wait for the up:replay daemon to finish.

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx