Re: multi-node NFS Ganesha + libcephfs caching

Daniel Gryniewicz <dang@xxxxxxxxxx> · Tue, 24 Mar 2020 13:38:41 -0400

On 3/24/20 1:16 PM, Maged Mokhtar wrote:

On 24/03/2020 16:48, Maged Mokhtar wrote:

On 24/03/2020 15:14, Daniel Gryniewicz wrote:

On 3/24/20 8:19 AM, Maged Mokhtar wrote:

On 24/03/2020 13:35, Daniel Gryniewicz wrote:

On 3/23/20 4:31 PM, Maged Mokhtar wrote:

On 23/03/2020 20:50, Jeff Layton wrote:
On Mon, 2020-03-23 at 15:49 +0200, Maged Mokhtar wrote:
Hello all,

For multi-node NFS Ganesha over CephFS, is it OK to leave 
libcephfs write caching on, or should it be configured off for 
failover ?

You can do libcephfs write caching, as the caps would need to be
recalled for any competing access. What you really want to avoid 
is any
sort of caching at the ganesha daemon layer.

Hi Jeff,

Thanks for your reply. I meant caching by libcepfs used within the 
ganesha ceph fsal plugin, which i am not sure from your reply if 
this is what you refer to as ganesha daemon layer (or does the 
later mean the internal mdcache in ganesha). I really appreciate 
if you can clarify this point.

Caching in libcephfs is fine, it's caching above the FSAL layer 
that you should avoid.

I really have doubts that it is safe to leave write caching in the 
plugin and have safe failover, yet i see comments in the conf file 
such as:
# The libcephfs client will aggressively cache information while it
# can, so there is little benefit to ganesha actively caching the 
same
# objects.

Or is it up to the NFS client to issue cache syncs and re-submit 
writes if it detects failover ?

Correct.  During failover, NFS will go into it's Grace period, 
which blocks new state,  and allow the NFS clients to re-acquire 
the state (opens, locks, delegations, etc.). This includes 
re-sending any non-committed writes (commits will cause the data to 
be saved to the cluster, not just the libcephfs cache).  Once this 
is all done, normal operation proceeds.  It should be safe, even 
with caching in libcephfs.

Daniel

Thanks Daniel for the clarification..so it is the responsibility of 
the client tor re-send writes...2 questions so i can understand this 
better:

-If this is handled at the client..why on the gateway it is ok to 
cache at the FSAL layer but not above ?

In principle, it's fine above.  However, that requires a level of 
coordination that's not there right now.  The libcephfs cache is 
integrated with the CAPs system, and knows when it can cache and when 
it needs to flush.  There's work to do to get that up to the higher 
layers.

-At what level/layer on the client does this get handled: NFS client 
layer (which will detect failover), filesystem layer, page cache...?

The NFS client layer, interacting with the VFS/page cache.  (NFS is 
the filesystem in this case, so technically the filesystem layer.)

Daniel

Thank you so much for the clarification..

Maged
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

One more thing: for non-Linux clients, specifically VMWare, their NFS 
client may not behave the same, correct ?  In the iSCSI domain, VMWare 
does not have any kind of buffer/page cache, which is probably to 
support failover among ESXi nodes, should i test this or am i on the 
wrong track ? /Maged

This behavior is a requirement of the spec.  All compliant NFS 
implementations behave this way.  If you don't have a client side cache, 
then you have to do only stable writes (each write is sync'd to the 
backing store).  This is slower, but it's safe.  If VMWare doesn't do 
this, then they *will* lose data if the server ever crashes, and it will 
be their exclusive fault.

Daniel
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx