Hi,
CephFS clients are blacklisted if they do not react to heartbeat
packets. The MDS will deny the reconnect:
[ 1815.029831] ceph: mds0 closed our session
[ 1815.029833] ceph: mds0 reconnect start
[ 1815.052219] ceph: mds0 reconnect denied
[ 1815.052229] ceph: dropping dirty Fw state for ffff9d9085da1340 1099512175611
[ 1815.052231] ceph: dropping dirty+flushing Fw state for ffff9d9085da1340 1099512175611
[ 1815.273008] libceph: mds0 10.99.10.4:6801 socket closed (con state NEGOTIATING)
[ 1816.033241] ceph: mds0 rejected session
[ 1829.018643] ceph: mds0 hung
[ 1880.088504] ceph: mds0 came back
[ 1880.088662] ceph: mds0 caps renewed
[ 1880.094018] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
[ 1881.100367] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
[ 2046.768969] conntrack: generic helper won't handle protocol 47. Please consider loading the specific helper module.
[ 2061.731126] ceph: get_quota_realm: ino (10000000afe.fffffffffffffffe) null i_snap_realm
This will render the mount useless until a complete remount is
happening. You can verify this by printing the osd block list after the
mount point is not usable anymore using the 'ceph osd blocklist ls' command.
The intention of this behavior is the handling of rogue / faulty
clients. If your client currently hold the caps for important
directories and the machine has a hardware error (and won't come back
soon), the access to the directories would be blocked. Other clients
won't be able to access them until the broken machine comes back.
Network outage is another example.
You can configure the mds session timeout that triggers blacklisting.
But keep in mind that simply using a longer timeout may lead to other
problems in case of real errors.
Regards,
Burkhard
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx