Re: CephFS + CTDB/Samba - MDS session timeout on lockfile

Eric Eastman <eric.eastman@xxxxxxxxxxxxxx> · Wed, 11 May 2016 09:02:14 -0600

On Wed, May 11, 2016 at 2:04 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> -----Original Message-----
>> From: Eric Eastman [mailto:eric.eastman@xxxxxxxxxxxxxx]
>> Sent: 10 May 2016 18:29
>> To: Nick Fisk <nick@xxxxxxxxxx>
>> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx>
>> Subject: Re:  CephFS + CTDB/Samba - MDS session timeout on
>> lockfile
>>
>> On Tue, May 10, 2016 at 6:48 AM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >
>> >
>> >> -----Original Message-----
>> >> From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf
>> >> Of Nick Fisk
>> >> Sent: 10 May 2016 13:30
>> >> To: 'Eric Eastman' <eric.eastman@xxxxxxxxxxxxxx>
>> >> Cc: 'Ceph Users' <ceph-users@xxxxxxxxxxxxxx>
>> >> Subject: Re:  CephFS + CTDB/Samba - MDS session timeout
>> >> on lockfile
>>
>> >> > On Mon, May 9, 2016 at 3:28 PM, Nick Fisk <nick@xxxxxxxxxx> wrote:
>> >> > > Hi Eric,
>> >> > >
>> >> > >>
>> >> > >> I am trying to do some similar testing with SAMBA and CTDB with
>> >> > >> the Ceph file system.  Are you using the vfs_ceph SAMBA module
>> >> > >> or are you kernel mounting the Ceph file system?
>> >> > >
>> >> > > I'm using the kernel client. I couldn't find any up to date
>> >> > > information on if
>> >> > the vfs plugin supported all the necessary bits and pieces.
>> >> > >
>> >> > > How is your testing coming along? I would be very interested in
>> >> > > any
>> >> > findings you may have come across.
>> >> > >
>> >> > > Nick
>> >> >
>> >> > I am also using CephFS kernel mounts, with 4 SAMBA gateways. When
>> >> from
>> >> > a SAMBA client, I write a large file (about 2GB) to a gateway that
>> >> > is not the holder of the CTDB lock file, and then kill that gateway
>> >> > server during the write, the IP failover works as expected, and in
>> >> > most cases the file ends up being the correct size after the new
>> >> > server finishes writing it, but the data is corrupt. The data in
>> >> > the
>> > file, from
>> >> the point of the failover, is all zeros.
>> >> >
>> >> > I thought the issue may be with the kernel mount, so I looked into
>> >> > using  the SAMBA vfs_ceph module, but I need SAMBA with AD support
>> >> and
>> >> > the current vfs_ceph module, even in the SAMBA git master version,
>> >> > is lacking ACL support for CephFS, as the vfs_ceph.c patches
>> >> > summited to the SAMBA mail list are not yet available. See:
>> >> > https://lists.samba.org/archive/samba-technical/2016-March/113063.h
>> >> > tml
>> >> >
>> >> > I tried using a FUSE mount of the CephFS, and it also fails setting
>> > ACLs.  See:
>> >> > http://tracker.ceph.com/issues/15783.
>> >> >
>> >> > My current status is IP failover is working, but I am seeing data
>> >> > corruption on writes to the share when using kernel mounts. I am
>> >> > also seeing the issue you reported when I kill the system holding
>> >> > the CTDB lock file.  Are you verifying your data after each failover?
>> >>
>> >> I must admit you are slightly ahead of me. I was initially trying to
>> >> just
>> > get
>> >> hard/soft failover working correctly. But your response has prompted
>> >> me to test out the scenario you mentioned. I'm seeing slightly
>> >> different
>> > results, my
>> >> copy seems to error out when I do a node failover. I'm copying an ISO
>> >> from
>> > a
>> >> 2008 server to the CTDB/Samba share and when I reboot the active
>> >> node, the copy pauses for a couple of seconds and then comes up with
>> >> the error box. Clicking try again several times doesn't let it
>> >> resume. I need to do
>> > a bit
>> >> more digging to try and work out why this is happening. The share
>> >> itself
>> > does
>> >> seem to be in a working state when trying to click the try again
>> >> button,
>> > so
>> >> there is probably some sort of state/session problem.
>> >>
>> >> Do you have multiple vip's configured on your cluster or just a single IP?
>> > I
>> >> have just the one at the moment.
>>
>> I have 4 HA addresses setup, and I am using my AD to do the round-robin
>> DNS. The moving of IP addresses on failure or when a CTDB controlled
>> SAMBA system comes on line works great.
>
> I've just added another VIP to the cluster so I will see if this changes anything.
>
>>
>> >
>> > Just to add to this, I have just been reading this article
>> >
>> > https://nnc3.com/mags/LM10/Magazine/Archive/2009/105/030-
>> 035_SambaHA/a
>> > rticle
>> > .html
>> >
>> > And the following paragraph seems to indicate that what I am seeing is
>> > the correct behaviour? I 'm wondering if this is not happening in your
>> > case and is why you are getting corruption?
>> >
>> > "It is important to understand that load balancing and client
>> > distribution over the client nodes are connection oriented. If an IP
>> > address is switched from one node to another, all the connections
>> > actively using this IP address are dropped and the clients have to
>> reconnect.
>> >
>> > To avoid delays, CTDB uses a trick: When an IP is switched, the new
>> > CTDB node "tickles" the client with an illegal TCP ACK packet (tickle
>> > ACK) containing an invalid sequence number of 0 and an ACK number of
>> > 0. The client responds with a valid ACK packet, allowing the new IP
>> > address owner to close the connection with an RST packet, thus forcing
>> > the client to reestablish the connection to the new node."
>> >
>>
>> Nice article.  I have been trying to figure out if data integrity is supported with
>> CTDB on failover on any shared file system.  From looking at various email
>> posts on CTDB+GPFS, it looks like it may work, so I am going to continue to
>> test it with various CephFS configurations.  There is a new "witness protocol"
>> in SMB3 to support failover, that is not yet supported in any released
>> versions of SAMBA.
>> I may have to wait for it to be implemented in SAMBA to get fully working
>> failover. See:
>>
>> https://wiki.samba.org/index.php/Samba3/SMB2#Witness_Notification_Pro
>> tocol
>> https://sambaxp.org/archive_data/SambaXP2015-
>> SLIDES/wed/track1/sambaxp2015-wed-track1-Guenther_Deschner-
>> ImplementingTheWitnessProtocolInSamba.pdf
>
> Yes I saw that as well, looks really good and would certainly make the whole solution very smooth. I tested the settings Ira posted to lower the MDS session timeout and can confirm that I can now hard kill a CTDB node without the others getting banned. I plan to do some more testing around this, but I would really like to hear from Ira what his concerns around the settings were.
>
> Ie.
> 1. Just untested, probably ok, but I'm not putting my name on it
> 2. Yeah I saw a big dragon fly out of nowhere and eat all my data

Thank you for the info on your testing with the lower MDS session
timeouts.  I will add those settings to my test system.

> Have you done any testing with CephFS snapshots? I was having a go at getting them working with "Previous Version" yesterday, which worked ok, but the warning on the CephFS page is a bit off putting.

I have been testing snapshots off the root directory since Hammer.
Over the last year I have found a few bugs, turned in tracker issues,
and they were quickly fixed. My default build scripts for my test
clusters setup a cronjob to do hourly snapshots, and I test the
snapshots from time to time.  All my SAMBA testing is being done with
snapshots on.  With Jewel, snapshots seem to be working well.  I do
not run snapshots on lower directories as I don't need that
functionality right now, and there are more warnings from the Ceph
engineers on using that additional functionality.

Eric
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com