Re: CephFS + CTDB/Samba - MDS session timeout on lockfile

Gregory Farnum <gfarnum@xxxxxxxxxx> · Mon, 9 May 2016 08:59:46 -0700

On Mon, May 9, 2016 at 8:48 AM, Sage Weil <sage@xxxxxxxxxxxx> wrote:
> On Mon, 9 May 2016, Nick Fisk wrote:
>> Hi All,
>>
>> I've been testing an active/active Samba cluster over CephFS, performance
>> seems really good with small files compared to Gluster. Soft reboots work
>> beautifully with little to no interruption in file access. However when I
>> perform a hard shutdown/reboot of one of the samba nodes, the remaining node
>> detects that the other Samba node has disappeared but then eventually bans
>> itself. If I leave everything for around 5 minutes, CTDB unbans itself and
>> then everything continues running.
>>
>> From what I can work out it looks like as the MDS has a stale session from
>> the powered down node, it won't let the remaining node access the CTDB lock
>> file (which is also sitting the on the CephFS). CTDB meanwhile is hammering
>> away trying to access the lock file, but it sees what it thinks is a split
>> brain scenario because something still has a lock on the lockfile, and so
>> bans itself.
>>
>> I'm guessing the solution is to either reduce the mds session timeout or
>> increase the amount of time/retries for CTDB, but I'm not sure what's the
>> best approach. Does anyone have any ideas?
>
> I believe Ira was looking at this exact issue, and addressed it by
> lowering the mds_session_timeout to 30 seconds?

That's the default timeout. I think he lowered the beacon intervals to
5 seconds, plus whatever else flows out from that.

We aren't quite sure if that's a good idea for real deployments or not, though!
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com