Hi list (and cephfs devs :-)), On 2020-04-29 17:43, Jake Grimmett wrote: > ...the "mdsmap_decode" errors stopped suddenly on all our clients... > > Not exactly sure what the problem was, but restarting our standby mds > demons seems to have been the fix. > > Here's the log on the standby mds exactly when the errors stopped: > > 2020-04-29 15:41:22.944 7f3d04e06700 1 mds.ceph-s2 Map has assigned me > to become a standby > 2020-04-29 15:43:05.621 7f3d04e06700 1 mds.ceph-s2 Updating MDS map to > version 394712 from mon.0 > 2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map i am now > mds.34541673.0 replaying mds.0.0 > 2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 handle_mds_map state > change up:boot --> up:standby-replay > 2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 replay_start > 2020-04-29 15:43:05.623 7f3d04e06700 1 mds.0.0 recovery set is > 2020-04-29 15:43:05.655 7f3cfe5f9700 0 mds.0.cache creating system > inode with ino:0x100 > 2020-04-29 15:43:05.656 7f3cfe5f9700 0 mds.0.cache creating system > inode with ino:0x1 So, we got some HEALTH_WARN on our cluster because of this issue. Cluster: 13.2.8 client: cephfs kernel client 5.7.9-050709-generic with 13.2.10 (Ubuntu 18.04) The standby mds, and only the standby, is logging about this: > 2020-08-27 06:25:01.086 7efc10cad700 -1 received signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw (PID: 21705) UID: 0 > 2020-08-27 08:42:25.340 7efc0d2be700 0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.497840 secs > 2020-08-27 08:42:25.340 7efc0d2be700 0 log_channel(cluster) log [WRN] : slow request 30.497839 seconds old, received at 2020-08-27 08:41:54.847218: client_request(client.133487514:37390263 getattr AsLsXsFs #0x10050572c4e 2020-08-27 08:41:54.840824 caller_uid=3860, caller_gid=3860{}) currently failed to rdlock, waiting > 2020-08-27 11:06:55.492 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 64.583827 seconds ago > 2020-08-27 11:07:55.502 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 124.593098 seconds ago > 2020-08-27 11:09:55.561 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 244.651434 seconds ago > 2020-08-27 11:13:55.505 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 484.596083 seconds ago > 2020-08-27 11:21:55.500 7efc0d2be700 0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 964.592686 seconds ago On the clients we get the "mdsmap_decode got incorrect state(up:standby-replay)" logging exactly on the times the mds2 is logging. No logging of this on the active mds. I would expect exactly the opposite. Why is the standby mds logging this? Sometimes the "client.$id isn't responding to mclientcaps(revoke)" warnings resolve itself. But it can also take a considerable amount of time. I of course could restart the standby mds ... but that's not my first choice. If this is a software defect, I would like to get it fixed. Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx