On a tangential note, any idea on how the change of nodeid affects caching (page, dentry, attribute cache etc) in VFS? Does it throw away the cache? If yes, after a graph switch, there might be performance drop (till the cache gets built again). It would also be worth exploring what are the other effects of a changed nodeid for a file/directory. ----- Original Message ----- > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > To: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, Gluster-devel@xxxxxxxxxxx, "Tero Venetjoki > (Nokia - FI/Espoo)" <tero.venetjoki@xxxxxxxxx>, "Cynthia Zhou (Nokia - CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx>, > "Csaba Henk" <chenk@xxxxxxxxxx> > Sent: Monday, March 20, 2017 7:05:08 PM > Subject: Re: Nodeid changed due to write-behind option chagned online will lead to un-expected umount > by kernel > > +csaba > > ----- Original Message ----- > > From: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx> > > To: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, > > Gluster-devel@xxxxxxxxxxx, "Tero Venetjoki > > (Nokia - FI/Espoo)" <tero.venetjoki@xxxxxxxxx>, "Cynthia Zhou (Nokia - > > CN/Hangzhou)" <cynthia.zhou@xxxxxxxxx> > > Sent: Monday, March 20, 2017 7:02:50 PM > > Subject: Re: Nodeid changed due to write-behind option > > chagned online will lead to un-expected umount > > by kernel > > > > > > > > ----- Original Message ----- > > > From: "George Lian (Nokia - CN/Hangzhou)" <george.lian@xxxxxxxxx> > > > To: "Raghavendra Gowdappa" <rgowdapp@xxxxxxxxxx>, > > > Gluster-devel@xxxxxxxxxxx > > > Cc: "Bingxuan Zhang (Nokia - CN/Hangzhou)" <bingxuan.zhang@xxxxxxxxx>, > > > "Cynthia Zhou (Nokia - CN/Hangzhou)" > > > <cynthia.zhou@xxxxxxxxx>, "Tero Venetjoki (Nokia - FI/Espoo)" > > > <tero.venetjoki@xxxxxxxxx> > > > Sent: Monday, March 20, 2017 8:44:30 AM > > > Subject: Nodeid changed due to write-behind option chagned online will > > > lead > > > to un-expected umount by kernel > > > > > > > > > Hi, GlusterFS expert, > > > > > > In our latest test, We found an issue potentially related to glusterfs. > > > When > > > I execute “gluster volume set <volume> performance.write-behind on/off”, > > > some bind mounts will get lost, this issue is permanent! > > > Test steps: > > > a) mkdir -p /mnt/log/node1/test1/test2; mkdir -p > > > /mnt/log/node2/test1/test2 > > > b) mkdir -p /mnt/test1/test2 > > > c) mount –bind /mnt/log/node1/test1 /mnt/test1 > > > d) mount –bind /mnt/log/node2/test1/test2 /mnt/test1/test2 > > > e) mount | cut -d " " -f 3|xargs stat > > > f) gluster volume set log performance.write-behind on/off check > > > /mnt/test1/test2 , could found this bind mount get lost > > > > > > We’ve consulted the linux kernel side guys, their explanations for this > > > “bind > > > mount lost” issue is when kernel side do lookup or stat(), it will find > > > the > > > nodeid has changed and trigger fuse_dentry_revalidate, and > > > fuse_dentry_relivalidate() fails for some dentry and invalidates it which > > > leads to unmounting. > > > > > > 88.912262 | 0) stat-437 | | > > > fuse_dentry_revalidate [fuse]() { > > > <...> > > > 88.912264 | 0) stat-437 | | > > > fuse_simple_request [fuse]() { > > > <...> > > > 88.921383 | 0) stat-437 | # 9119.255 us | } > > > /* > > > fuse_simple_request [fuse] */ > > > 88.921383 | 0) stat-437 | 0.093 us | > > > dput(); > > > 88.921384 | 0) stat-437 | | > > > fuse_queue_forget [fuse]() { > > > <...> > > > 88.921427 | 0) stat-437 | + 42.737 us | } > > > 88.921427 | 0) stat-437 | # 9164.967 us | } /* > > > fuse_dentry_revalidate [fuse] */ > > > 88.921427 | 0) stat-437 | | > > > d_invalidate() { > > > 88.921427 | 0) stat-437 | 0.048 us | > > > _raw_spin_lock(); > > > 88.921428 | 0) stat-437 | | > > > d_walk() { > > > 88.921428 | 0) stat-437 | 0.040 us | > > > _raw_spin_lock(); > > > 88.921428 | 0) stat-437 | 0.046 us | > > > detach_and_collect(); > > > 88.921429 | 0) stat-437 | 0.723 us | } > > > 88.921429 | 0) stat-437 | | > > > __detach_mounts() { > > > > > > fuse_queue_forget() got called inside fuse_dentry_revalidate(), so the > > > failure is clearly caused by (outarg.nodeid != get_node_id(inode)) > > > check: > > > > > > 7078187a795f8 (Miklos Szeredi 2014-12-12 09:49:05 +0100 230) > > > if (!ret) { > > > 6314efee3cfee (Miklos Szeredi 2013-10-01 16:41:22 +0200 231) > > > fi = get_fuse_inode(inode); > > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 232) > > > if (outarg.nodeid != get_node_id(inode)) { > > > 07e77dca8a1f1 (Miklos Szeredi 2010-12-07 20:16:56 +0100 233) > > > fuse_queue_forget(fc, forget, outarg.nodeid, > > > 1); > > > e2a6b95236eba (Miklos Szeredi 2013-09-05 11:44:43 +0200 234) > > > goto invalid; > > > 9e6268db496a2 (Miklos Szeredi 2005-09-09 13:10:29 -0700 235) > > > } > > > > > > The latest kernel is doing the following in fs/fuse/dir.c: > > > > > > if (outarg.nodeid != get_node_id(inode)) { > > > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > > > goto invalid; > > > } > > > > > > invalid: > > > ret = 0; > > > goto out; > > > } > > > > > > When We check the glusterfs code, I find that when execute command > > > “gluster > > > volume set <volume> performance.write-behind on/off”, the glusterfs > > > client > > > side graph will be re-built. And all xlators will be re-initialized , > > > So I am wondering if this will lead to node id change and cause kernel > > > side > > > do the umount, because I find in function fuse_graph_setup, > > > inode_table_new > > > is called, does it mean all inode will be changed? > > > > Yes nodeids change after graph switch. Nice catch btw :). > > > > > If so, we need consider the following solutions > > > 1) is there some way to notify kernel side in graceful way so that > > > kernel > > > side can do update the nodeid which stored before? > > > > AFAIK, there is no such way. > > > > > 2) is there some way let itable only initialized once, that's mean, we > > > only > > > initialized while fuse_graph_setup is first called, and reused the > > > itable > > > address, while new graph is created and fuse_graph_setup called due to > > > options changed such like performance.write_behind. > > > > We cannot use the inodes from old graph, because per xlator state from > > xlators of new graph is not present in them. So, we build new inodes and > > let > > interested xlators in the new graph store the state. This is done by > > sending > > a lookup call for each nodeid the kernel is using (on demand basis. We > > don't > > do this if kernel doesn't do an operation on the inodes it has cached). > > > > The other possible approach is to build the state in old inodes instead of > > creating new inodes (that way we don't change nodeid). But how scalable is > > that approach is the question (what happens if there are too many graph > > switches and functions like inode_ctx_get/set slow down due to too many > > xlator states). Also, this is just a thought and I've to think more on > > this. > > Even if we do this, it will be a more disruptive change and needs more > > testing to be accepted as a solution. > > > > > 3) is there some way let ino(which showed in Annex 1) not changed due > > > to > > > graph rebuild, I mean not through address, such like the mapping of the > > > gfid, then even if the itable rebuild, glusterfs have the same nodeid > > > which > > > stored in kernel if solution 1 and solution 2 can't be used? > > > > Using plain gfids as nodeids would be the best solution. But kernel allows > > only 64bit nodeids and gfids are bigger. Mapping requires one more > > data-structure (say hash-table etc), that brings more overhead and I guess > > we wanted to keep things simple (by just using address of inode object as > > nodeid - Annex 1). > > > > > 4) any other solutions? > > > > I can't think of an easy simple solution now. Will update if I can find > > something. > > > > > > > > I would like to know your opinion about this issue and our solutions, and > > > your reply will be appreciated! > > > > > > Annex 1: inode mapping with the address of node stored in itable. > > > fuse_ino_to_inode (uint64_t ino, xlator_t *fuse) > > > { > > > inode_t *inode = NULL; > > > xlator_t *active_subvol = NULL; > > > > > > if (ino == 1) { > > > active_subvol = fuse_active_subvol (fuse); > > > if (active_subvol) > > > inode = active_subvol->itable->root; > > > } else { > > > inode = (inode_t *) (unsigned long) ino; > > > inode_ref (inode); > > > } > > > > > > return inode; > > > } > > > > > > Annex 2: kernel code changed for this issue, which mean for some kernel > > > there is no un-expect umount issue, but has DOS-attack issue, and while > > > the > > > latest kernel will have the un-expect umount issue due to > > > fuse_graph_setup > > > rebuild. > > > 1) no un-expected umount issue but has DOS-attack issue. > > > kernel is doing the following in fs/fuse/dir.c: > > > > > > if (outarg.nodeid != get_node_id(inode)) { > > > fuse_queue_forget(fc, forget, outarg.nodeid, 1); > > > goto invalid; > > > } > > > > > > invalid: > > > ret = 0; > > > if (check_submounts_and_drop(entry) != 0) > > > ret = 1; > > > goto out; > > > } > > > > > > > > > The above check_submounts_and_drop() call was added in September 2013: > > > > > > commit 46ea1562da792a94ee8391f7aed882eb7771e18c > > > Author: Anand Avati <avati@xxxxxxxxxx> > > > Date: Thu Sep 5 11:44:44 2013 +0200 > > > > > > fuse: drop dentry on failed revalidate > > > > > > Drop a subtree when we find that it has moved or been delated. This > > > can > > > be > > > done as long as there are no submounts under this location. > > > > > > If the directory was moved and we come across the same directory in a > > > future lookup it will be reconnected by d_materialise_unique(). > > > > > > Signed-off-by: Anand Avati <avati@xxxxxxxxxx> > > > Signed-off-by: Miklos Szeredi <mszeredi@xxxxxxx> > > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > > index 25c6cfe98801..0e6961aae6c0 100644 > > > --- a/fs/fuse/dir.c > > > +++ b/fs/fuse/dir.c > > > @@ -259,6 +259,8 @@ out: > > > > > > invalid: > > > ret = 0; > > > + if (check_submounts_and_drop(entry) != 0) > > > + ret = 1; > > > goto out; > > > } > > > > > > Ie. this version of fuse specifically avoided invalidating the dentry if > > > there were directories mounted under it. > > > (The check_submounts_and_drop() function is part of general vfs code, not > > > fuse specific.) > > > > > > Then, in October 2013 there was a wider kernel change which caused > > > invalid > > > mounts to be deleted more aggressively > > > for security reasons (elimination of certain kind of DOS attacks): > > > > > > commit 8ed936b5671bfb33d89bc60bdcc7cf0470ba52fe > > > Author: Eric W. Biederman <ebiederman@xxxxxxxxxxx> > > > Date: Tue Oct 1 18:33:48 2013 -0700 > > > > > > vfs: Lazily remove mounts on unlinked files and directories. > > > > > > With the introduction of mount namespaces and bind mounts it became > > > possible to access files and directories that on some paths are mount > > > points but are not mount points on other paths. It is very confusing > > > when rm -rf somedir returns -EBUSY simply because somedir is mounted > > > somewhere else. With the addition of user namespaces allowing > > > unprivileged mounts this condition has gone from annoying to allowing > > > a DOS attack on other users in the system. > > > > > > The possibility for mischief is removed by updating the vfs to > > > support > > > rename, unlink and rmdir on a dentry that is a mountpoint and by > > > lazily unmounting mountpoints on deleted dentries. > > > > > > > > > The above commit changed check_submounts_and_drop() function so that it > > > will > > > not return non-zero error status any more if the entry is a mount point > > > (or > > > has mount points under it). > > > Instead it simply unmounts those entries unconditionally. > > > > > > > > > Finally, that call was removed from fuse code because it was not doing > > > anything anymore: > > > > > > commit 9b053f3207e8887fed88162a339fdd4001abcdb2 > > > Author: Eric W. Biederman <ebiederm@xxxxxxxxxxxx> > > > Date: Thu Feb 13 09:34:30 2014 -0800 > > > > > > vfs: Remove unnecessary calls of check_submounts_and_drop > > > > > > Now that check_submounts_and_drop can not fail and is called from > > > d_invalidate there is no longer a need to call > > > check_submounts_and_drom > > > from filesystem d_revalidate methods so remove it. > > > > > > Reviewed-by: Miklos Szeredi <miklos@xxxxxxxxxx> > > > Signed-off-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx> > > > Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> > > > > > > diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c > > > index de1d84af9f7c..820efd74ca9f 100644 > > > --- a/fs/fuse/dir.c > > > +++ b/fs/fuse/dir.c > > > @@ -274,9 +274,6 @@ out: > > > > > > > > > 2) has un-expected umount issue. > > > invalid: > > > ret = 0; > > > - > > > - if (!(flags & LOOKUP_RCU) && check_submounts_and_drop(entry) != > > > 0) > > > - ret = 1; > > > goto out; > > > } > > > > > > Later, the check_submounts_and_drop() function was merged to > > > d_invalidate() > > > and ceased to exist. > > > > > > So, the unmounting does not happen in CentOS because it is using quite > > > old > > > and more relaxed kernel where filesystem code lets the mount exists even > > > if > > > glusterfs node id has changed. > > > But newer kernels released during the last 3 years are using stricter > > > mount > > > handling policy which causes the unmounting (and most likely will not go > > > back). > > > > > > > > > Best regards, > > > George > > > > > > > > > > > _______________________________________________ > > Gluster-devel mailing list > > Gluster-devel@xxxxxxxxxxx > > http://lists.gluster.org/mailman/listinfo/gluster-devel > _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://lists.gluster.org/mailman/listinfo/gluster-devel