Re: md-cache improvements

Niels de Vos <ndevos@xxxxxxxxxx> · Wed, 17 Aug 2016 15:10:37 +0200

On Wed, Aug 17, 2016 at 08:59:35AM -0400, Dan Lambright wrote:
> 
> 
> ----- Original Message -----
> > From: "Niels de Vos" <ndevos@xxxxxxxxxx>
> > To: "Raghavendra G" <raghavendra@xxxxxxxxxxx>
> > Cc: "Dan Lambright" <dlambrig@xxxxxxxxxx>, "Gluster Devel" <gluster-devel@xxxxxxxxxxx>, "Csaba Henk"
> > <csaba.henk@xxxxxxxxx>
> > Sent: Wednesday, August 17, 2016 4:49:41 AM
> > Subject: Re:  md-cache improvements
> > 
> > On Wed, Aug 17, 2016 at 11:42:25AM +0530, Raghavendra G wrote:
> > > On Fri, Aug 12, 2016 at 10:29 AM, Raghavendra G <raghavendra@xxxxxxxxxxx>
> > > wrote:
> > > 
> > > >
> > > >
> > > > On Thu, Aug 11, 2016 at 9:31 AM, Raghavendra G <raghavendra@xxxxxxxxxxx>
> > > > wrote:
> > > >
> > > >> Couple of more areas to explore:
> > > >> 1. purging kernel dentry and/or page-cache too. Because of patch [1],
> > > >> upcall notification can result in a call to inode_invalidate, which
> > > >> results
> > > >> in an "invalidate" notification to fuse kernel module. While I am sure
> > > >> that, this notification will purge page-cache from kernel, I am not sure
> > > >> about dentries. I assume if an inode is invalidated, it should result in
> > > >> a
> > > >> lookup (from kernel to glusterfs). But neverthless, we should look into
> > > >> differences between entry_invalidation and inode_invalidation and
> > > >> harness
> > > >> them appropriately.
> > 
> > I do not think fuse handles upcall yet. I think there is a patch for
> > that somewhere. It's been a while since I looked into that, but I think
> > invalidating the affected dentries was straight forwards.
> 
> Can the patch # be tracked down ? I'd like to run some experiments with it + tiering..

I'll attach my attempt here. I dont remember the results of it though.

Niels


> 
> 
> > 
> > > >> 2. Granularity of invalidation. For eg., We shouldn't be purging
> > > >> page-cache in kernel, because of a change in xattr used by an xlator
> > > >> (eg.,
> > > >> dht layout xattr). We have to make sure that [1] is handling this. We
> > > >> need
> > > >> to add more granularity into invaldation (like internal xattr
> > > >> invalidation,
> > > >> user xattr invalidation, entry invalidation in kernel, page-cache
> > > >> invalidation in kernel, attribute/stat invalidation in kernel etc) and
> > > >> use
> > > >> them judiciously, while making sure other cached data remains to be
> > > >> present.
> > > >>
> > > >
> > > > To stress the importance of this point, it should be noted that with tier
> > > > there can be constant migration of files, which can result in spurious
> > > > (from perspective of application) invalidations, even though application
> > > > is
> > > > not doing any writes on files [2][3][4]. Also, even if application is
> > > > writing to file, there is no point in invalidating dentry cache. We
> > > > should
> > > > explore more ways to solve [2][3][4].
> > 
> > Actually upcall tracks the client/inode combination, and only sends
> > upcall events to clients that (recently/timeout?) accessed the inode.
> > There should not be any upcalls for inodes that the client did not
> > access. So, when promotion/demotion happens, only the process doing this
> > should receive the event, not any of the other clients that did not
> > access the inode.
> > 
> > > > 3. We've a long standing issue of spurious termination of fuse
> > > > invalidation thread. Since after termination, the thread is not
> > > > re-spawned,
> > > > we would not be able to purge kernel entry/attribute/page-cache. This
> > > > issue
> > > > was touched upon during a discussion [5], though we didn't solve the
> > > > problem then for lack of bandwidth. Csaba has agreed to work on this
> > > > issue.
> > > >
> > > 
> > > 4. Flooding of network with upcall notifications. Is it a problem? If yes,
> > > does upcall infra already solves it? Would NFS/SMB leases help here?
> > 
> > I guess some form of flooding is possible when two or more clients do
> > many directory operations in the same directory. Hmm, now I wonder if a
> > client gets an upcall event for something it did itself. I guess that
> > would (most often?) not be needed.
> > 
> > Niels
> > 
> > 
> > > 
> > > 
> > > > [2] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c7
> > > > [3] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c8
> > > > [4] https://bugzilla.redhat.com/show_bug.cgi?id=1293967#c9
> > > > [5] http://review.gluster.org/#/c/13274/1/xlators/mount/
> > > > fuse/src/fuse-bridge.c
> > > >
> > > >
> > > >>
> > > >> [1] http://review.gluster.org/12951
> > > >>
> > > >>
> > > >> On Wed, Aug 10, 2016 at 10:35 PM, Dan Lambright <dlambrig@xxxxxxxxxx>
> > > >> wrote:
> > > >>
> > > >>>
> > > >>> There have been recurring discussions within the gluster community to
> > > >>> build on existing support for md-cache and upcalls to help performance
> > > >>> for
> > > >>> small file workloads. In certain cases, "lookup amplification"
> > > >>> dominates
> > > >>> data transfers, i.e. the cumulative round trip times of multiple
> > > >>> LOOKUPs
> > > >>> from the client mitigates benefits from faster backend storage.
> > > >>>
> > > >>> To tackle this problem, one suggestion is to more aggressively utilize
> > > >>> md-cache to cache inodes on the client than is currently done. The
> > > >>> inodes
> > > >>> would be cached until they are invalidated by the server.
> > > >>>
> > > >>> Several gluster development engineers within the DHT, NFS, and Samba
> > > >>> teams have been involved with related efforts, which have been underway
> > > >>> for
> > > >>> some time now. At this juncture, comments are requested from gluster
> > > >>> developers.
> > > >>>
> > > >>> (1) .. help call out where additional upcalls would be needed to
> > > >>> invalidate stale client cache entries (in particular, need feedback
> > > >>> from
> > > >>> DHT/AFR areas),
> > > >>>
> > > >>> (2) .. identify failure cases, when we cannot trust the contents of
> > > >>> md-cache, e.g. when an upcall may have been dropped by the network
> > > >>>
> > > >>> (3) .. point out additional improvements which md-cache needs. For
> > > >>> example, it cannot be allowed to grow unbounded.
> > > >>>
> > > >>> Dan
> > > >>>
> > > >>> ----- Original Message -----
> > > >>> > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>
> > > >>> >
> > > >>> > List of areas where we need invalidation notification:
> > > >>> > 1. Any changes to xattrs used by xlators to store metadata (like dht
> > > >>> layout
> > > >>> > xattr, afr xattrs etc).
> > > >>> > 2. Scenarios where individual xlator feels like it needs a lookup.
> > > >>> > For
> > > >>> > example failed directory creation on non-hashed subvol in dht during
> > > >>> mkdir.
> > > >>> > Though dht succeeds mkdir, it would be better to not cache this inode
> > > >>> as a
> > > >>> > subsequent lookup will heal the directory and make things better.
> > > >>> > 3. removing of files
> > > >>> > 4. writev on brick (to invalidate read cache on client)
> > > >>> >
> > > >>> > Other questions:
> > > >>> > 5. Does md-cache has cache management? like lru or an upper limit for
> > > >>> cache.
> > > >>> > 6. Network disconnects and invalidating cache. When a network
> > > >>> disconnect
> > > >>> > happens we need to invalidate cache for inodes present on that brick
> > > >>> as we
> > > >>> > might be missing some notifications. Current approach of purging
> > > >>> > cache
> > > >>> of
> > > >>> > all inodes might not be optimal as it might rollback benefits of
> > > >>> caching.
> > > >>> > Also, please note that network disconnects are not rare events.
> > > >>> >
> > > >>> > regards,
> > > >>> > Raghavendra
> > > >>> _______________________________________________
> > > >>> Gluster-devel mailing list
> > > >>> Gluster-devel@xxxxxxxxxxx
> > > >>> http://www.gluster.org/mailman/listinfo/gluster-devel
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Raghavendra G
> > > >>
> > > >
> > > >
> > > >
> > > > --
> > > > Raghavendra G
> > > >
> > > 
> > > 
> > > 
> > > --
> > > Raghavendra G
> > 
> > > _______________________________________________
> > > Gluster-devel mailing list
> > > Gluster-devel@xxxxxxxxxxx
> > > http://www.gluster.org/mailman/listinfo/gluster-devel
> > 
> > 
From a58caf0dd20d7652a4d5cbc4318b93419cd89416 Mon Sep 17 00:00:00 2001
From: Niels de Vos <ndevos@xxxxxxxxxx>
Date: Mon, 7 Dec 2015 17:40:27 +0100
Subject: [PATCH] fuse: basic upcall support for cache-invalidation

Q: Should we make this configurable with a mount option?
Q: Should we increase the timeout (1 sec currently) for inode caching?

Change-Id: Ica521dfd0386f28bd96a126d96d50fdd4a29d49d
Signed-off-by: Niels de Vos <ndevos@xxxxxxxxxx>
---
 xlators/mount/fuse/src/fuse-bridge.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/xlators/mount/fuse/src/fuse-bridge.c b/xlators/mount/fuse/src/fuse-bridge.c
index 5d6979a..548c88a 100644
--- a/xlators/mount/fuse/src/fuse-bridge.c
+++ b/xlators/mount/fuse/src/fuse-bridge.c
@@ -16,6 +16,7 @@
 #include "compat-errno.h"
 #include "glusterfs-acl.h"
 #include "syscall.h"
+#include "upcall-utils.h"
 
 #ifdef __NetBSD__
 #undef open /* in perfuse.h, pulled from mount-gluster-compat.h */
@@ -5203,6 +5204,24 @@ notify (xlator_t *this, int32_t event, void *data, ...)
                fini (this);
                break;
         }
+#if FUSE_KERNEL_MINOR_VERSION >= 11
+        case GF_EVENT_UPCALL:
+        {
+                /* At the moment we only care for cache-invalidation */
+                struct gf_upcall *ue = data;
+
+                if (ue->event_type == GF_UPCALL_CACHE_INVALIDATION) {
+                        struct gf_upcall_cache_invalidation *uci = ue->data;
+                        uint64_t fuse_ino = uci->stat.ia_ino;
+
+                        if (private->enable_ino32)
+                                fuse_ino = GF_FUSE_SQUASH_INO (fuse_ino);
+
+                        fuse_invalidate_entry (this, fuse_ino);
+                }
+                break;
+        }
+#endif
 
         default:
                 break;
-- 
2.7.4

Attachment:
signature.asc

Description: PGP signature
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-devel