> -----Original Message----- > From: Ira Cooper [mailto:icooper@xxxxxxxxxx] > Sent: 09 May 2016 17:31 > To: Sage Weil <sage@xxxxxxxxxxxx> > Cc: Nick Fisk <nick@xxxxxxxxxx>; ceph-users@xxxxxxxxxxxxxx > Subject: Re: CephFS + CTDB/Samba - MDS session timeout on > lockfile > ----- Original Message ----- > > On Mon, 9 May 2016, Nick Fisk wrote: > > > Hi All, > > > > > > I've been testing an active/active Samba cluster over CephFS, > > > performance seems really good with small files compared to Gluster. > > > Soft reboots work beautifully with little to no interruption in file > > > access. However when I perform a hard shutdown/reboot of one of the > > > samba nodes, the remaining node detects that the other Samba node > > > has disappeared but then eventually bans itself. If I leave > > > everything for around 5 minutes, CTDB unbans itself and then > > > everything continues running. > > > > > > From what I can work out it looks like as the MDS has a stale > > > session from the powered down node, it won't let the remaining node > > > access the CTDB lock file (which is also sitting the on the CephFS). > > > CTDB meanwhile is hammering away trying to access the lock file, but > > > it sees what it thinks is a split brain scenario because something > > > still has a lock on the lockfile, and so bans itself. > > > > > > I'm guessing the solution is to either reduce the mds session > > > timeout or increase the amount of time/retries for CTDB, but I'm not > > > sure what's the best approach. Does anyone have any ideas? > > > > I believe Ira was looking at this exact issue, and addressed it by > > lowering the mds_session_timeout to 30 seconds? > > Actually... > > There's a problem with the way I did it, in that there's issues in CephFS that > start to come out. Like the fact that it doesn't ban clients properly. :( Could you shed any more light on what this issues might be? I'm assuming they are around the locking part of ctdb? > > Greg's made comments about this not being production safe, I tend to agree. > ;) > > But it is possible, to make the cluster happy, I've been testing on VMs with > the following added to my ceph.conf for "a while" now. > > DISCLAIMER: THESE ARE NOT PRODUCTION SETTINGS! DO NOT USE IN > PRODUCTION IF YOU LIKE YOUR DATA! > > mds_session_timeout = 5 > mds_tick_interval = 1 > mon_tick_interval = 1 > mon_session_timeout = 2 > mds_session_autoclose = 15 These all look like they make Ceph more responsive to the loss of a client, as per your warning above, what negative effects do you see potentially arising from them? Or is that more of a warning as they haven't had long term testing? If the problem is only around the ctdb locking to avoid split brain, I would imagine using ctdb in conjunction with pacemaker to handle the fencing would also be a workaround? > > Since I did this, there have been changes made to CTDB to allow an external > program to be the arbitrator instead of the fcntl lockfile. I'm working on an > etcd integration for that. Not that it is that complicated, but making sure you > get the details right is a minor pain. > > Also I'll be giving a talk on all of this at SambaXP on Thursday, so if you are > there, feel free to catch me in the hall. (That goes for anyone interested in > this topic or ceph/samba topics in general!) I would be really interested in slides/video if there will be any post event. > > Clearly my being at SambaXP will slow the etcd integration down. And I'm > betting Greg, John or Sage will want to talk to me about using mon instead of > etcd ;). Call it a "feeling". > > Cheers, > > -Ira _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com