Allowing cephfs clients to reconnect

Mikael Öhman <micketeer@xxxxxxxxx> · Wed, 13 Nov 2019 15:23:25 -0000

Hi, I'm trying to make our system a bit more fault tolerant, and I struggle a bit with letting clients reconnect if they have lost contact for a while.
When there is a temporary network problem, I would like clients to block I/O, wait for a connection, and resume.
Do I have any options other than just increasing mds_session_autoclose ?
Is there a downside for using very large value here (like, a full day?)? I expect all clients to be connected at all times anyway when things are running normally.

What I see right now (if the disconnect is sufficiently long) is that the ceph client releases the I/O block, and you get permission denied on all I/O operations on the existing mount point.
Re-mounting it works, but, this also requires killing off all active session blocking unmounting. Basically, just overall bad is this happens, and I would prefer almost any other option.

I can see that the client tries a reconnect when this happens:
Nov 12 11:53:24 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 connection reset
Nov 12 11:53:24 hebbe01-3 kernel: libceph: reset on mds0
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 closed our session
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect start
Nov 12 11:53:24 hebbe01-3 kernel: ceph: mds0 reconnect denied
Nov 12 11:56:55 hebbe01-3 kernel: libceph: mds0 10.43.20.3:6800 socket closed (con state NEGOTIATING)
Nov 12 11:56:55 hebbe01-3 kernel: ceph: mds0 rejected session
but the logs on the MDS server disallows it as it's not in a "reconnect state"-
So, if I understand this correctly, reconnecting is just available in the case that the MDS server was rebooted?

Best regards, Mikael
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx