----- Original Message ----- > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, Gluster-devel@xxxxxxxxxxx > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, "Cynthia Zhou (Nokia - CN/Hangzhou)" > <cynthia.zhou@xxxxxxxxx>, "Tero Venetjoki (Nokia - FI/Espoo)" <tero.venetjoki@xxxxxxxxx> > Sent: Monday, March 20, 2017 8:44:30 AM > Subject: Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel > > > Hi, GlusterFS expert, > > In our latest test, We found an issue potentially related to glusterfs. When > I execute “gluster volume set <volume> performance.write-behind on/off”, > some bind mounts will get lost, this issue is permanent! > Test steps: > a) mkdir -p /mnt/log/node1/test1/test2; mkdir -p /mnt/log/node2/test1/test2 > b) mkdir -p /mnt/test1/test2 > c) mount –bind /mnt/log/node1/test1 /mnt/test1 > d) mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2 > e) mount | cut -d " " -f 3|xargs stat > f) gluster volume set log performance.write-behind on/off check > /mnt/test1/test2 , could found this bind mount get lost > > We’ve consulted the linux kernel side guys, their explanations for this “bind > mount lost” issue is when kernel side do lookup or stat(), it will find the > nodeid has changed and trigger fuse_dentry_revalidate, and > fuse_dentry_relivalidate() fails for some dentry and invalidates it which > leads to unmounting. > > 88.912262 | 0) stat-437 | | > fuse_dentry_revalidate [fuse]() { > <...> > 88.912264 | 0) stat-437 | | > fuse_simple_request [fuse]() { > <...> > 88.921383 | 0) stat-437 | # 9119.255 us | } /* > fuse_simple_request [fuse] */ > 88.921383 | 0) stat-437 | 0.093 us | dput(); > 88.921384 | 0) stat-437 | | > fuse_queue_forget [fuse]() { > <...> > 88.921427 | 0) stat-437 | + 42.737 us | } > 88.921427 | 0) stat-437 | # 9164.967 us | } /* > fuse_dentry_revalidate [fuse] */ > 88.921427 | 0) stat-437 | | > d_invalidate() { > 88.921427 | 0) stat-437 | 0.048 us | > _raw_spin_lock(); > 88.921428 | 0) stat-437 | | > d_walk() { > 88.921428 | 0) stat-437 | 0.040 us | > _raw_spin_lock(); > 88.921428 | 0) stat-437 | 0.046 us | > detach_and_collect(); > 88.921429 | 0) stat-437 | 0.723 us | } > 88.921429 | 0) stat-437 | | > __detach_mounts() { > > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the > failure is clearly caused by (outarg.nodeid != get_node_id(inode)) check: > > 7078187a795f8 (Miklos Szeredi 2014-12-12 09:49:05 +0100 230) > if (!ret) { > 6314efee3cfee (Miklos Szeredi 2013-10-01 16:41:22 +0200 231) > fi = get_fuse_inode(inode); > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 232) > if (outarg.nodeid != get_node_id(inode)) { > 07e77dca8a1f1 (Miklos Szeredi 2010-12-07 20:16:56 +0100 233) > fuse_queue_forget(fc, forget, outarg.nodeid, > 1); > e2a6b95236eba (Miklos Szeredi 2013-09-05 11:44:43 +0200 234) > goto invalid; > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 235) > } > > The latest kernel is doing the following in fs/fuse/dir.c: > > if (outarg.nodeid != get_node_id(inode)) { > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > goto invalid; > } > > invalid: > ret = 0; > goto out; > } > > When We check the glusterfs code, I find that when execute command “gluster > volume set <volume> performance.write-behind on/off”, the glusterfs client > side graph will be re-built. And all xlators will be re-initialized , > So I am wondering if this will lead to node id change and cause kernel side > do the umount, because I find in function fuse_graph_setup, inode_table_new > is called, does it mean all inode will be changed? Yes nodeids change after graph switch. Nice catch btw :). > If so, we need consider the following solutions > 1) is there some way to notify kernel side in graceful way so that kernel > side can do update the nodeid which stored before? AFAIK, there is no such way. > 2) is there some way let itable only initialized once, that's mean, we only > initialized while fuse_graph_setup is first called, and reused the itable > address, while new graph is created and fuse_graph_setup called due to > options changed such like performance.write_behind. We cannot use the inodes from old graph, because per xlator state from xlators of new graph is not present in them. So, we build new inodes and let interested xlators in the new graph store the state. This is done by sending a lookup call for each nodeid the kernel is using (on demand basis. We don't do this if kernel doesn't do an operation on the inodes it has cached). The other possible approach is to build the state in old inodes instead of creating new inodes (that way we don't change nodeid). But how scalable is that approach is the question (what happens if there are too many graph switches and functions like inode_ctx_get/set slow down due to too many xlator states). Also, this is just a thought and I've to think more on this. Even if we do this, it will be a more disruptive change and needs more testing to be accepted as a solution. > 3) is there some way let ino(which showed in Annex 1) not changed due to > graph rebuild, I mean not through address, such like the mapping of the > gfid, then even if the itable rebuild, glusterfs have the same nodeid which > stored in kernel if solution 1 and solution 2 can't be used? Using plain gfids as nodeids would be the best solution. But kernel allows only 64bit nodeids and gfids are bigger. Mapping requires one more data-structure (say hash-table etc), that brings more overhead and I guess we wanted to keep things simple (by just using address of inode object as nodeid - Annex 1). > 4) any other solutions? I can't think of an easy simple solution now. Will update if I can find something. > > I would like to know your opinion about this issue and our solutions, and > your reply will be appreciated! > > Annex 1: inode mapping with the address of node stored in itable. > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse) > { > inode_t *inode = NULL; > xlator_t *active_subvol = NULL; > > if (ino == 1) { > active_subvol = fuse_active_subvol (fuse); > if (active_subvol) > inode = active_subvol->itable->root; > } else { > inode = (inode_t *) (unsigned long) ino; > inode_ref (inode); > } > > return inode; > } > > Annex 2: kernel code changed for this issue, which mean for some kernel > there is no un-expect umount issue, but has DOS-attack issue, and while the > latest kernel will have the un-expect umount issue due to fuse_graph_setup > rebuild. > 1) no un-expected umount issue but has DOS-attack issue. > kernel is doing the following in fs/fuse/dir.c: > > if (outarg.nodeid != get_node_id(inode)) { > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > goto invalid; > } > > invalid: > ret = 0; > if (check_submounts_and_drop(entry) != 0) > ret = 1; > goto out; > } > > > The above check_submounts_and_drop() call was added in September 2013: > > commit 46ea1562da792a94ee8391f7aed882eb7771e18c > Author: Anand Avati <avati@xxxxxxxxxx> > Date: Thu Sep 5 11:44:44 2013 +0200 > > fuse: drop dentry on failed revalidate > > Drop a subtree when we find that it has moved or been delated. This can > be > done as long as there are no submounts under this location. > > If the directory was moved and we come across the same directory in a > future lookup it will be reconnected by d_materialise_unique(). > > Signed-off-by: Anand Avati <avati@xxxxxxxxxx> > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxx> > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index 25c6cfe98801..0e6961aae6c0 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -259,6 +259,8 @@ out: > > invalid: > ret = 0; > + if (check_submounts_and_drop(entry) != 0) > + ret = 1; > goto out; > } > > Ie. this version of fuse specifically avoided invalidating the dentry if > there were directories mounted under it. > (The check_submounts_and_drop() function is part of general vfs code, not > fuse specific.) > > Then, in October 2013 there was a wider kernel change which caused invalid > mounts to be deleted more aggressively > for security reasons (elimination of certain kind of DOS attacks): > > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe > Author: Eric W. Biederman <ebiederman@xxxxxxxxxxx> > Date: Tue Oct 1 18:33:48 2013 -0700 > > vfs: Lazily remove mounts on unlinked files and directories. > > With the introduction of mount namespaces and bind mounts it became > possible to access files and directories that on some paths are mount > points but are not mount points on other paths. It is very confusing > when rm -rf somedir returns -EBUSY simply because somedir is mounted > somewhere else. With the addition of user namespaces allowing > unprivileged mounts this condition has gone from annoying to allowing > a DOS attack on other users in the system. > > The possibility for mischief is removed by updating the vfs to support > rename, unlink and rmdir on a dentry that is a mountpoint and by > lazily unmounting mountpoints on deleted dentries. > > > The above commit changed check_submounts_and_drop() function so that it will > not return non-zero error status any more if the entry is a mount point (or > has mount points under it). > Instead it simply unmounts those entries unconditionally. > > > Finally, that call was removed from fuse code because it was not doing > anything anymore: > > commit 9b053f3207e8887fed88162a339fdd4001abcdb2 > Author: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > Date: Thu Feb 13 09:34:30 2014 -0800 > > vfs: Remove unnecessary calls of check_submounts_and_drop > > Now that check_submounts_and_drop can not fail and is called from > d_invalidate there is no longer a need to call check_submounts_and_drom > from filesystem d_revalidate methods so remove it. > > Reviewed-by: Miklos Szeredi <miklos@xxxxxxxxxx> > Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > index de1d84af9f7c..820efd74ca9f 100644 > --- a/fs/fuse/dir.c > +++ b/fs/fuse/dir.c > @@ -274,9 +274,6 @@ out: > > > 2) has un-expected umount issue. > invalid: > ret = 0; > - > - if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0) > - ret = 1; > goto out; > } > > Later, the check_submounts_and_drop() function was merged to d_invalidate() > and ceased to exist. > > So, the unmounting does not happen in CentOS because it is using quite old > and more relaxed kernel where filesystem code lets the mount exists even if > glusterfs node id has changed. > But newer kernels released during the last 3 years are using stricter mount > handling policy which causes the unmounting (and most likely will not go > back). > > > Best regards, > George > > > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel