Replay MDS server stuck

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Fri, 9 Aug 2019 08:41:17 -0700

We had a outage of our Jewel 10.2.11 CephFS last night. Our primary MDS hit an assert in ceph try_remove_dentries_for_stray(), but the replay MDS never came up. The logs for MDS02 show:
---like clockwork these first two lines appear every second---
2019-08-02 16:27:24.664508 7f6f47f5c700  1 mds.0.0 standby_replay_restart (as standby)
2019-08-02 16:27:24.689079 7f6f43341700  1 mds.0.0 replay_done (as standby)
2019-08-02 16:27:25.689210 7f6f47f5c700  1 mds.0.0 standby_replay_restart (as standby)
2019-08-09 00:01:25.298131 7f6f4a862700  1 mds.0.555510 handle_mds_map i am now mds.0.555510
2019-08-09 00:01:25.298135 7f6f4a862700  1 mds.0.555510 handle_mds_map state change up:standby-replay --> up:replay
2019-08-09 00:43:35.382921 7f6f46f5a700 -1 mds.sun-gcs01-mds02 *** got signal Terminated ***
2019-08-09 00:43:35.382952 7f6f46f5a700  1 mds.sun-gcs01-mds02 suicide.  wanted state up:replay
2019-08-09 00:43:35.409663 7f6f46f5a700  1 mds.0.555510 shutdown: shutting down rank 0
2019-08-09 00:43:35.414729 7f6f43341700  0 mds.0.log _replay journaler got error -11, aborting
2019-08-09 00:48:36.819539 7fe6e89e8200  0 set uid:gid to 1001:1002 (ceph:ceph)
2019-08-09 00:48:36.819555 7fe6e89e8200  0 ceph version 10.2.11 (e4b061b47f07f583c92a050d9e84b1813a35671e), process ceph-mds, pid 39603
2019-08-09 00:48:36.820045 7fe6e89e8200  0 pidfile_write: ignore empty --pid-file
2019-08-09 00:48:37.813857 7fe6e2088700  1 mds.sun-gcs01-mds02 handle_mds_map standby
2019-08-09 00:48:37.833089 7fe6e2088700  1 mds.0.0 handle_mds_map i am now mds.19317235.0 replaying mds.0.0
2019-08-09 00:48:37.833097 7fe6e2088700  1 mds.0.0 handle_mds_map state change up:boot --> up:standby-replay
2019-08-09 00:48:37.833106 7fe6e2088700  1 mds.0.0 replay_start
2019-08-09 00:48:37.833111 7fe6e2088700  1 mds.0.0  recovery set is
2019-08-09 00:48:37.849332 7fe6dd77e700  0 mds.0.cache creating system inode with ino:100
2019-08-09 00:48:37.849627 7fe6dd77e700  0 mds.0.cache creating system inode with ino:1
2019-08-09 00:48:40.548094 7fe6dab67700  0 log_channel(cluster) log [WRN] :  replayed op client.10012302:8321663,8321660 used ino 10052d9c287 but session next is 10052d57512
2019-08-09 00:48:40.844534 7fe6dab67700  1 mds.0.0 replay_done (as standby)
2019-08-09 00:48:41.844648 7fe6df782700  1 mds.0.0 standby_replay_restart (as standby)
2019-08-09 00:48:41.868242 7fe6dab67700  1 mds.0.0 replay_done (as standby)
---last two lines repeat again every second---

I was thinking of a couple of option to improve recovery in this situation.

1. Monitor the replay status of the replay MDS and alert if it is too far behind the active MDS, or if it isn't incrementing. I'm looking at `ceph daemon mds.<id> perf dump` as mds_log:rdpos was mentioned in another thread, but the replay MDS has a higher number than the primary. Maybe even some sort of timestamp or entry that increments with time or something we can check.

2. A third MDS that isn't replaying the journal and is more of a cold standby. It would take longer to start up, but seems if the replay failed to take over, then it would read the journal and start up.

It seems that the issue that caused the primary to fail is hard to catch and debug, so we just need to be sure that we can recover with failover. Any hints on how to improve that would be appreciated.

Thank you,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com