dmesg: mdsc_handle_reply got x on session mds1 not mds0

胡玮文 <sehuww@xxxxxxxxxxxxxxxx> · Mon, 10 Jan 2022 02:17:11 +0800 (GMT+08:00)

Hi ceph developers,

Today we got one of our OSD hosts hang on OOM. Some OSDs were flapping and eventually went down and out. The recovery caused one OSD to go full, which is used in both cephfs metadata and data pools.

The strange thing is:
* Many of our users report unexpected “Permission denied” error when creating new files
* dmesg has some strange error (see examples below). During that time, no special logs on both active MDSes.
* The above two strange things happens BEFORE the OSD got full.

Jan 09 01:27:13 gpu027 kernel: libceph: osd9 up
Jan 09 01:27:13 gpu027 kernel: libceph: osd10 up
Jan 09 01:28:55 gpu027 kernel: libceph: osd9 down
Jan 09 01:28:55 gpu027 kernel: libceph: osd10 down
Jan 09 01:32:35 gpu027 kernel: libceph: osd6 weight 0x0 (out)
Jan 09 01:32:35 gpu027 kernel: libceph: osd16 weight 0x0 (out)
Jan 09 01:34:18 gpu027 kernel: libceph: osd1 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd9 weight 0x0 (out)
Jan 09 01:39:20 gpu027 kernel: libceph: osd10 weight 0x0 (out)
Jan 09 01:53:07 gpu027 kernel: ceph: mdsc_handle_reply got 30408991 on session mds1 not mds0
Jan 09 01:53:14 gpu027 kernel: ceph: mdsc_handle_reply got 30409829 on session mds1 not mds0
Jan 09 01:53:15 gpu027 kernel: ceph: mdsc_handle_reply got 30409925 on session mds1 not mds0
Jan 09 01:53:28 gpu027 kernel: ceph: mdsc_handle_reply got 30411416 on session mds1 not mds0
Jan 09 02:05:07 gpu027 kernel: ceph: mdsc_handle_reply got 30417742 on session mds0 not mds1
Jan 09 02:48:52 gpu027 kernel: ceph: mdsc_handle_reply got 30449177 on session mds1 not mds0
Jan 09 02:49:17 gpu027 kernel: ceph: mdsc_handle_reply got 30452750 on session mds1 not mds0

After reading the code, the replies are unexpected and just dropped. Any ideas about how this could happen? And is there anything I need to worry about? (The cluster is now recovered and looks good)

The clients are Ubuntu 20.04 with kernel 5.11.0-43-generic. Ceph version is 16.2.7. No active MDS restarts during that time. Standby-replay MDSes did restart, which should be fixed by my PR https://github.com/ceph/ceph/pull/44501 . But I don’t know if it is related to the issue here.

Regards,
Weiwen Hu