Hi, Csaba, Could you please give some comments for this issues? Thanks & Best Regards, George -----Original Message----- From: Raghavendra Gowdappa [mailto:rgowdapp@xxxxxxxxxx] Sent: Monday, March 20, 2017 9:35 PM To: Lian, George (Nokia - CN/Hangzhou) <george.lian@xxxxxxxxx> Cc: Zhang, Bingxuan (Nokia - CN/Hangzhou) <bingxuan.zhang@xxxxxxxxx>; Gluster-devel@xxxxxxxxxxx; Venetjoki, Tero (Nokia - FI/Espoo) <tero.venetjoki@xxxxxxxxx>; Zhou, Cynthia (Nokia - CN/Hangzhou) <cynthia.zhou@xxxxxxxxx>; Csaba Henk <chenk@xxxxxxxxxx> Subject: Re: Nodeid changed due to write-behind option chagned online will lead to un-expected umount by kernel +csaba ----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, Gluster-devel@xxxxxxxxxxx, "Tero Venetjoki > (Nokia - FI/Espoo)" <tero.venetjoki@xxxxxxxxx>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx> > Sent: Monday, March 20, 2017 7:02:50 PM > Subject: Re: Nodeid changed due to write-behind option chagned online will lead to un-expected umount > by kernel > > > > ----- Original Message ----- > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, Gluster-devel@xxxxxxxxxxx > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, > > "Cynthia Zhou (Nokia - CN/Hangzhou)" > > <cynthia.zhou@xxxxxxxxx>, "Tero Venetjoki (Nokia - FI/Espoo)" > > <tero.venetjoki@xxxxxxxxx> > > Sent: Monday, March 20, 2017 8:44:30 AM > > Subject: Nodeid changed due to write-behind option chagned online will lead > > to un-expected umount by kernel > > > > > > Hi, GlusterFS expert, > > > > In our latest test, We found an issue potentially related to glusterfs. > > When > > I execute “gluster volume set <volume> performance.write-behind on/off”, > > some bind mounts will get lost, this issue is permanent! > > Test steps: > > a) mkdir -p /mnt/log/node1/test1/test2; mkdir -p /mnt/log/node2/test1/test2 > > b) mkdir -p /mnt/test1/test2 > > c) mount –bind /mnt/log/node1/test1 /mnt/test1 > > d) mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2 > > e) mount | cut -d " " -f 3|xargs stat > > f) gluster volume set log performance.write-behind on/off check > > /mnt/test1/test2 , could found this bind mount get lost > > > > We’ve consulted the linux kernel side guys, their explanations for this > > “bind > > mount lost” issue is when kernel side do lookup or stat(), it will find the > > nodeid has changed and trigger fuse_dentry_revalidate, and > > fuse_dentry_relivalidate() fails for some dentry and invalidates it which > > leads to unmounting. > > > > 88.912262 | 0) stat-437 | | > > fuse_dentry_revalidate [fuse]() { > > <...> > > 88.912264 | 0) stat-437 | | > > fuse_simple_request [fuse]() { > > <...> > > 88.921383 | 0) stat-437 | # 9119.255 us | } /* > > fuse_simple_request [fuse] */ > > 88.921383 | 0) stat-437 | 0.093 us | > > dput(); > > 88.921384 | 0) stat-437 | | > > fuse_queue_forget [fuse]() { > > <...> > > 88.921427 | 0) stat-437 | + 42.737 us | } > > 88.921427 | 0) stat-437 | # 9164.967 us | } /* > > fuse_dentry_revalidate [fuse] */ > > 88.921427 | 0) stat-437 | | > > d_invalidate() { > > 88.921427 | 0) stat-437 | 0.048 us | > > _raw_spin_lock(); > > 88.921428 | 0) stat-437 | | > > d_walk() { > > 88.921428 | 0) stat-437 | 0.040 us | > > _raw_spin_lock(); > > 88.921428 | 0) stat-437 | 0.046 us | > > detach_and_collect(); > > 88.921429 | 0) stat-437 | 0.723 us | } > > 88.921429 | 0) stat-437 | | > > __detach_mounts() { > > > > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the > > failure is clearly caused by (outarg.nodeid != get_node_id(inode)) check: > > > > 7078187a795f8 (Miklos Szeredi 2014-12-12 09:49:05 +0100 230) > > if (!ret) { > > 6314efee3cfee (Miklos Szeredi 2013-10-01 16:41:22 +0200 231) > > fi = get_fuse_inode(inode); > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 232) > > if (outarg.nodeid != get_node_id(inode)) { > > 07e77dca8a1f1 (Miklos Szeredi 2010-12-07 20:16:56 +0100 233) > > fuse_queue_forget(fc, forget, outarg.nodeid, > > 1); > > e2a6b95236eba (Miklos Szeredi 2013-09-05 11:44:43 +0200 234) > > goto invalid; > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 235) > > } > > > > The latest kernel is doing the following in fs/fuse/dir.c: > > > > if (outarg.nodeid != get_node_id(inode)) { > > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > > goto invalid; > > } > > > > invalid: > > ret = 0; > > goto out; > > } > > > > When We check the glusterfs code, I find that when execute command “gluster > > volume set <volume> performance.write-behind on/off”, the glusterfs client > > side graph will be re-built. And all xlators will be re-initialized , > > So I am wondering if this will lead to node id change and cause kernel side > > do the umount, because I find in function fuse_graph_setup, inode_table_new > > is called, does it mean all inode will be changed? > > Yes nodeids change after graph switch. Nice catch btw :). > > > If so, we need consider the following solutions > > 1) is there some way to notify kernel side in graceful way so that kernel > > side can do update the nodeid which stored before? > > AFAIK, there is no such way. > > > 2) is there some way let itable only initialized once, that's mean, we > > only > > initialized while fuse_graph_setup is first called, and reused the itable > > address, while new graph is created and fuse_graph_setup called due to > > options changed such like performance.write_behind. > > We cannot use the inodes from old graph, because per xlator state from > xlators of new graph is not present in them. So, we build new inodes and let > interested xlators in the new graph store the state. This is done by sending > a lookup call for each nodeid the kernel is using (on demand basis. We don't > do this if kernel doesn't do an operation on the inodes it has cached). > > The other possible approach is to build the state in old inodes instead of > creating new inodes (that way we don't change nodeid). But how scalable is > that approach is the question (what happens if there are too many graph > switches and functions like inode_ctx_get/set slow down due to too many > xlator states). Also, this is just a thought and I've to think more on this. > Even if we do this, it will be a more disruptive change and needs more > testing to be accepted as a solution. > > > 3) is there some way let ino(which showed in Annex 1) not changed due to > > graph rebuild, I mean not through address, such like the mapping of the > > gfid, then even if the itable rebuild, glusterfs have the same nodeid > > which > > stored in kernel if solution 1 and solution 2 can't be used? > > Using plain gfids as nodeids would be the best solution. But kernel allows > only 64bit nodeids and gfids are bigger. Mapping requires one more > data-structure (say hash-table etc), that brings more overhead and I guess > we wanted to keep things simple (by just using address of inode object as > nodeid - Annex 1). > > > 4) any other solutions? > > I can't think of an easy simple solution now. Will update if I can find > something. > > > > > I would like to know your opinion about this issue and our solutions, and > > your reply will be appreciated! > > > > Annex 1: inode mapping with the address of node stored in itable. > > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse) > > { > > inode_t *inode = NULL; > > xlator_t *active_subvol = NULL; > > > > if (ino == 1) { > > active_subvol = fuse_active_subvol (fuse); > > if (active_subvol) > > inode = active_subvol->itable->root; > > } else { > > inode = (inode_t *) (unsigned long) ino; > > inode_ref (inode); > > } > > > > return inode; > > } > > > > Annex 2: kernel code changed for this issue, which mean for some kernel > > there is no un-expect umount issue, but has DOS-attack issue, and while the > > latest kernel will have the un-expect umount issue due to fuse_graph_setup > > rebuild. > > 1) no un-expected umount issue but has DOS-attack issue. > > kernel is doing the following in fs/fuse/dir.c: > > > > if (outarg.nodeid != get_node_id(inode)) { > > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > > goto invalid; > > } > > > > invalid: > > ret = 0; > > if (check_submounts_and_drop(entry) != 0) > > ret = 1; > > goto out; > > } > > > > > > The above check_submounts_and_drop() call was added in September 2013: > > > > commit 46ea1562da792a94ee8391f7aed882eb7771e18c > > Author: Anand Avati <avati@xxxxxxxxxx> > > Date: Thu Sep 5 11:44:44 2013 +0200 > > > > fuse: drop dentry on failed revalidate > > > > Drop a subtree when we find that it has moved or been delated. This > > can > > be > > done as long as there are no submounts under this location. > > > > If the directory was moved and we come across the same directory in a > > future lookup it will be reconnected by d_materialise_unique(). > > > > Signed-off-by: Anand Avati <avati@xxxxxxxxxx> > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxx> > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > index 25c6cfe98801..0e6961aae6c0 100644 > > --- a/fs/fuse/dir.c > > +++ b/fs/fuse/dir.c > > @@ -259,6 +259,8 @@ out: > > > > invalid: > > ret = 0; > > + if (check_submounts_and_drop(entry) != 0) > > + ret = 1; > > goto out; > > } > > > > Ie. this version of fuse specifically avoided invalidating the dentry if > > there were directories mounted under it. > > (The check_submounts_and_drop() function is part of general vfs code, not > > fuse specific.) > > > > Then, in October 2013 there was a wider kernel change which caused invalid > > mounts to be deleted more aggressively > > for security reasons (elimination of certain kind of DOS attacks): > > > > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe > > Author: Eric W. Biederman <ebiederman@xxxxxxxxxxx> > > Date: Tue Oct 1 18:33:48 2013 -0700 > > > > vfs: Lazily remove mounts on unlinked files and directories. > > > > With the introduction of mount namespaces and bind mounts it became > > possible to access files and directories that on some paths are mount > > points but are not mount points on other paths. It is very confusing > > when rm -rf somedir returns -EBUSY simply because somedir is mounted > > somewhere else. With the addition of user namespaces allowing > > unprivileged mounts this condition has gone from annoying to allowing > > a DOS attack on other users in the system. > > > > The possibility for mischief is removed by updating the vfs to support > > rename, unlink and rmdir on a dentry that is a mountpoint and by > > lazily unmounting mountpoints on deleted dentries. > > > > > > The above commit changed check_submounts_and_drop() function so that it > > will > > not return non-zero error status any more if the entry is a mount point (or > > has mount points under it). > > Instead it simply unmounts those entries unconditionally. > > > > > > Finally, that call was removed from fuse code because it was not doing > > anything anymore: > > > > commit 9b053f3207e8887fed88162a339fdd4001abcdb2 > > Author: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > > Date: Thu Feb 13 09:34:30 2014 -0800 > > > > vfs: Remove unnecessary calls of check_submounts_and_drop > > > > Now that check_submounts_and_drop can not fail and is called from > > d_invalidate there is no longer a need to call check_submounts_and_drom > > from filesystem d_revalidate methods so remove it. > > > > Reviewed-by: Miklos Szeredi <miklos@xxxxxxxxxx> > > Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > index de1d84af9f7c..820efd74ca9f 100644 > > --- a/fs/fuse/dir.c > > +++ b/fs/fuse/dir.c > > @@ -274,9 +274,6 @@ out: > > > > > > 2) has un-expected umount issue. > > invalid: > > ret = 0; > > - > > - if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != 0) > > - ret = 1; > > goto out; > > } > > > > Later, the check_submounts_and_drop() function was merged to d_invalidate() > > and ceased to exist. > > > > So, the unmounting does not happen in CentOS because it is using quite old > > and more relaxed kernel where filesystem code lets the mount exists even if > > glusterfs node id has changed. > > But newer kernels released during the last 3 years are using stricter mount > > handling policy which causes the unmounting (and most likely will not go > > back). > > > > > > Best regards, > > George > > > > > > > _______________________________________________ > Gluster-devel mailing list > Gluster-devel@xxxxxxxxxxx > http://lists.gluster.org/mailman/listinfo/gluster-devel _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel