Re: Ceph MDS stays in "up:replay" for hours. MDS failover takes 10-15 hours.

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Tue, 22 Sep 2020 09:17:02 -0700



On Tue, Sep 22, 2020 at 2:59 AM <heilig.oleg@xxxxxxxxx> wrote:
>
> Hi there,
>
> We have 9 nodes Ceph cluster. Ceph version is 15.2.5. The cluster has 175 OSD (HDD) + 3 NVMe for cache tier for "cephfs_data" pool. CephFS pools info:
> POOL                    ID  STORED   OBJECTS  USED     %USED  MAX AVAIL
> cephfs_data              1  350 TiB  179.53M  350 TiB  66.93     87 TiB
> cephfs_metadata          3  3.1 TiB   17.69M  3.1 TiB   1.77     87 TiB
> We use multiple active MDS instances: 3 "active" and 3 "standby". Each MDS server has 128GB RAM, "mds cache memory limit" = 64GB.
>
> Failover to a standby MDS instance takes 10-15 hours! CephFS is unreachable for the clients all this time. The MDS instance just stays in "up:replay" state for all this time.
> It looks like MDS demon checking all of the folders:
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EOpen.replay
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay 3 dirlumps by unknown.0
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay dir 0x300000041c5
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x300000041c5 /repository/files/14/ [2,head] auth v=2070324 cv=0/0 state=1610612737|complete f(v0 m2020-09-10T13:05:29.297254-0700 515=0+515) n(v46584 rc2020-09-21T20:38:49.071043-0700 b3937793650802 1056114=601470+454644) hs=515+0,ss=0+0 dirty=75 | child=1 subtree=0 dirty=1 0x55d4c9359b80]
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay for [2,head] had [dentry #0x1/repository/files/14/14119 [2,head] auth (dversion lock) v=2049516 ino=0x30000812e2f state=1073741824 | inodepin=1 0x55db2463a1c0]
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay for [2,head] had [inode 0x30000812e2f [...2,head] /repository/files/14/14119/ auth fragtree_t(*^3) v2049516 f(v0 m2020-09-18T10:17:53.379121-0700 13498=0+13498) n(v6535 rc2020-09-19T05:52:25.035403-0700 b272027384385 112669=81992+30677) (iversion lock) | dirfrag=8 0x55db24643000]
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay dir 0x30000812e2f.000*
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay updated dir [dir 0x30000812e2f.000* /repository/files/14/14119/ [2,head] auth v=77082 cv=0/0 state=1073741824 f(v0 m2020-09-18T10:17:53.371122-0700 1636=0+1636) n(v6535 rc2020-09-19T05:51:18.063949-0700 b33321023818 13707=9986+3721) hs=885+0,ss=0+0 | child=1 0x55db845bf080]
> 2020-09-22T02:43:44.406-0700 7f22ae99e700 10 mds.0.journal EMetaBlob.replay added (full) [dentry #0x1/repository/files/14/14119/39823 [2,head] auth NULL (dversion lock) v=0 ino=(nil) state=1073741888|bottomlru 0x55d82061a900]
>
> We tried standby-replay and it helps but doesn't eliminate the root cause.
> We have millions folders with millions of small files. When the folders/subfolders scan is done, CephFS is active again. I believe 10 hours downtime is unexpected behaviour. Is there any way to force MDS to change status to active and run all of the required directory checks in the background? How can I localise the root cause?

Link to a tracker issue where some discussion has taken place:
https://tracker.ceph.com/issues/47582

-- 
Patrick Donnelly, Ph.D.
He / Him / His
Principal Software Engineer
Red Hat Sunnyvale, CA
GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx