MDS stuck in replay

Magnus HAGDORN <Magnus.Hagdorn@xxxxxxxx> · Tue, 31 May 2022 07:41:42 +0000

Hi all,
it seems to be the time of stuck MDSs. We also have our ceph filesystem
degraded. The MDS is stuck in replay for about 20 hours now.

We run a nautilus ceph cluster with about 300TB of data and many
millions of files. We run two MDSs with a particularly large directory
pinned to one of them. Both MDSs have standby MDSs.

 We are in the process of migrating to a new pacific cluster and have
been syncing files daily. Over the weekend something happened and we
ended up with slow MDS responses and some directories became very slow
(as we'd expect). We restarted the second MDS. It came back within a
minute and the problem disappeared for a little while. The slow MDS
operations came back and we restarted the other MDS. This one has been
in replay state since yesterday.

The cluster is healthy.

So, we are wondering what it is up to. How long it might take. And is
there something we can do to speed up the replay phase.

Regards
magnus
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx