On Mon, May 9, 2016 at 8:48 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote: > On Mon, 9 May 2016, Nick Fisk wrote: >> Hi All, >> >> I've been testing an active/active Samba cluster over CephFS, performance >> seems really good with small files compared to Gluster. Soft reboots work >> beautifully with little to no interruption in file access. However when I >> perform a hard shutdown/reboot of one of the samba nodes, the remaining node >> detects that the other Samba node has disappeared but then eventually bans >> itself. If I leave everything for around 5 minutes, CTDB unbans itself and >> then everything continues running. >> >> From what I can work out it looks like as the MDS has a stale session from >> the powered down node, it won't let the remaining node access the CTDB lock >> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering >> away trying to access the lock file, but it sees what it thinks is a split >> brain scenario because something still has a lock on the lockfile, and so >> bans itself. >> >> I'm guessing the solution is to either reduce the mds session timeout or >> increase the amount of time/retries for CTDB, but I'm not sure what's the >> best approach. Does anyone have any ideas? > > I believe Ira was looking at this exact issue, and addressed it by > lowering the mds_session_timeout to 30 seconds? That's the default timeout. I think he lowered the beacon intervals to 5 seconds, plus whatever else flows out from that. We aren't quite sure if that's a good idea for real deployments or not, though! -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com