Re: kernel: ceph: mdsmap_decode got incorrect state(up:standby-replay)

Stefan Kooman <stefan@xxxxxx> · Thu, 27 Aug 2020 11:39:37 +0200

Hi list (and cephfs devs :-)),

On 2020-04-29 17:43, Jake Grimmett wrote:
> ...the "mdsmap_decode" errors stopped suddenly on all our clients...
> 
> Not exactly sure what the problem was, but restarting our standby mds
> demons seems to have been the fix.
> 
> Here's the log on the standby mds exactly when the errors stopped:
> 
> 2020-04-29 15:41:22.944 7f3d04e06700  1 mds.ceph-s2 Map has assigned me
> to become a standby
> 2020-04-29 15:43:05.621 7f3d04e06700  1 mds.ceph-s2 Updating MDS map to
> version 394712 from mon.0
> 2020-04-29 15:43:05.623 7f3d04e06700  1 mds.0.0 handle_mds_map i am now
> mds.34541673.0 replaying mds.0.0
> 2020-04-29 15:43:05.623 7f3d04e06700  1 mds.0.0 handle_mds_map state
> change up:boot --> up:standby-replay
> 2020-04-29 15:43:05.623 7f3d04e06700  1 mds.0.0 replay_start
> 2020-04-29 15:43:05.623 7f3d04e06700  1 mds.0.0  recovery set is
> 2020-04-29 15:43:05.655 7f3cfe5f9700  0 mds.0.cache creating system
> inode with ino:0x100
> 2020-04-29 15:43:05.656 7f3cfe5f9700  0 mds.0.cache creating system
> inode with ino:0x1

So, we got some HEALTH_WARN on our cluster because of this issue.

Cluster: 13.2.8
client: cephfs kernel client 5.7.9-050709-generic with 13.2.10 (Ubuntu
18.04)

The standby mds, and only the standby, is logging about this:

> 2020-08-27 06:25:01.086 7efc10cad700 -1 received  signal: Hangup from pkill -1 -x ceph-mon|ceph-mgr|ceph-mds|ceph-osd|ceph-fuse|radosgw  (PID: 21705) UID: 0
> 2020-08-27 08:42:25.340 7efc0d2be700  0 log_channel(cluster) log [WRN] : 1 slow requests, 1 included below; oldest blocked for > 30.497840 secs
> 2020-08-27 08:42:25.340 7efc0d2be700  0 log_channel(cluster) log [WRN] : slow request 30.497839 seconds old, received at 2020-08-27 08:41:54.847218: client_request(client.133487514:37390263 getattr AsLsXsFs #0x10050572c4e 2020-08-27 08:41:54.840824 caller_uid=3860, caller_gid=3860{}) currently failed to rdlock, waiting
> 2020-08-27 11:06:55.492 7efc0d2be700  0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 64.583827 seconds ago
> 2020-08-27 11:07:55.502 7efc0d2be700  0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 124.593098 seconds ago
> 2020-08-27 11:09:55.561 7efc0d2be700  0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 244.651434 seconds ago
> 2020-08-27 11:13:55.505 7efc0d2be700  0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 484.596083 seconds ago
> 2020-08-27 11:21:55.500 7efc0d2be700  0 log_channel(cluster) log [WRN] : client.134430768 isn't responding to mclientcaps(revoke), ino 0x1005081be30 pending pAsLsXsFscr issued pAsLsXsFscr, sent 964.592686 seconds ago

On the clients we get the "mdsmap_decode got incorrect
state(up:standby-replay)" logging exactly on the times the mds2 is logging.

No logging of this on the active mds.

I would expect exactly the opposite. Why is the standby mds logging this?

Sometimes the "client.$id isn't responding to mclientcaps(revoke)"
warnings resolve itself. But it can also take a considerable amount of time.

I of course could restart the standby mds ... but that's not my first
choice. If this is a software defect, I would like to get it fixed.

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx