On Thu, 2015-07-09 at 19:17 +0800, Ian Kent wrote: > On Wed, 2015-07-08 at 02:42 +0100, Al Viro wrote: > > Normally opening a file, unlinking it and then closing will have > > the inode freed upon close() (provided that it's not otherwise busy and > > has no remaining links, of course). However, there's one case where that > > does *not* happen. Namely, if you open it by fhandle with cold dcache, > > then unlink() and close(). > > > > In normal case you get d_delete() in unlink(2) notice that dentry > > is busy and unhash it; on the final dput() it will be forcibly evicted from > > dcache, triggering iput() and inode removal. In this case, though, we end > > up with *two* dentries - disconnected (created by open-by-fhandle) and > > regular one (used by unlink()). The latter will have its reference to inode > > dropped just fine, but the former will not - it's considered hashed (it > > is on the ->s_anon list), so it will stay around until the memory pressure > > will finally do it in. As the result, we have the final iput() delayed > > indefinitely. It's trivial to reproduce - > > > > #define _GNU_SOURCE > > #include <sys/types.h> > > #include <sys/stat.h> > > #include <fcntl.h> > > > > void flush_dcache(void) > > { > > system("mount -o remount,rw /"); > > } > > > > static char buf[20 * 1024 * 1024]; > > > > main() > > { > > int fd; > > union { > > struct file_handle f; > > char buf[MAX_HANDLE_SZ]; > > } x; > > int m; > > > > x.f.handle_bytes = sizeof(x); > > chdir("/root"); > > mkdir("foo", 0700); > > fd = open("foo/bar", O_CREAT | O_RDWR, 0600); > > close(fd); > > name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); > > flush_dcache(); > > fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); > > unlink("foo/bar"); > > write(fd, buf, sizeof(buf)); > > system("df ."); /* 20Mb eaten */ > > close(fd); > > system("df ."); /* should've freed those 20Mb */ > > flush_dcache(); > > system("df ."); /* should be the same as #2 */ > > } > > > > will spit out something like > > Filesystem 1K-blocks Used Available Use% Mounted on > > /dev/root 322023 303843 1131 100% / > > Filesystem 1K-blocks Used Available Use% Mounted on > > /dev/root 322023 303843 1131 100% / > > Filesystem 1K-blocks Used Available Use% Mounted on > > /dev/root 322023 283282 21692 93% / > > - inode gets freed only when dentry is finally evicted (here we trigger > > than by remount; normally it would've happened in response to memory > > pressure hell knows when). > > > > IMO it's a bug. Between the close() and final flush_dcache() the file has > > no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot > > output, and yet it's still not freed. I'm not saying that it's realistically > > exploitable (albeit with nfsd it might be), but it's a very unpleasant > > self-LART. > > > > FWIW, my prefered fix would be simply to have the final dput() treat > > disconnected dentries as "kill on sight"; checking for i_nlink of the > > inode, as Bruce suggested several years ago, will *not* work, simply > > because having another link to that file and unlinking it after close > > will reproduce the situation - disconnected dentry sticks around in > > dcache past its final dput() and past the last unlink() of our file. > > Theoretically it might cause an overhead for nfsd (no_subtree_check v3 > > export might see more d_alloc()/d_free(); icache retention will still > > prevent constant rereading the inode from disk). _IF_ that proves to > > be noticable, we might need to come up with something more elaborate > > (e.g. have unlink() and rename() kick disconnected aliases out if the link > > count has reached 0), but it's more complex and needs careful ananlysis > > of correctness - we need to prove that there's no way to miss the link > > count reaching 0. I would prefer to treat all disconnected as unhashed > > for dcache retention purposes - it's simpler and less brittle. Comments? > > I mean something like this: > > Al, help me out here, I'm struggling to understand where these dentrys > come from (apart from your reproducer). > > For example, on the heavily patched 2.6.32 kernel where this was first > seen the problem dentry is annoymous, refcount 0, and unhashed. > > But the dentrys that will most likely face summary execution will be > hashed, such as was the case on that 2.6.32 kernel at dput(). > > Doesn't that mean that something dropped the dentry after the dput(), > that will now also free the dentry, that took the refcount to 0? Oh wait, think I get it now ... perhaps it's prune_one_dentry() doing it ... Ian -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html