On Wed, Jul 08, 2015 at 02:42:38AM +0100, Al Viro wrote: > Normally opening a file, unlinking it and then closing will have > the inode freed upon close() (provided that it's not otherwise busy and > has no remaining links, of course). However, there's one case where that > does *not* happen. Namely, if you open it by fhandle with cold dcache, > then unlink() and close(). > > In normal case you get d_delete() in unlink(2) notice that dentry > is busy and unhash it; on the final dput() it will be forcibly evicted from > dcache, triggering iput() and inode removal. In this case, though, we end > up with *two* dentries - disconnected (created by open-by-fhandle) and > regular one (used by unlink()). The latter will have its reference to inode > dropped just fine, but the former will not - it's considered hashed (it > is on the ->s_anon list), so it will stay around until the memory pressure > will finally do it in. As the result, we have the final iput() delayed > indefinitely. It's trivial to reproduce - > > #define _GNU_SOURCE > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > > void flush_dcache(void) > { > system("mount -o remount,rw /"); > } > > static char buf[20 * 1024 * 1024]; > > main() > { > int fd; > union { > struct file_handle f; > char buf[MAX_HANDLE_SZ]; > } x; > int m; > > x.f.handle_bytes = sizeof(x); > chdir("/root"); > mkdir("foo", 0700); > fd = open("foo/bar", O_CREAT | O_RDWR, 0600); > close(fd); > name_to_handle_at(AT_FDCWD, "foo/bar", &x.f, &m, 0); > flush_dcache(); > fd = open_by_handle_at(AT_FDCWD, &x.f, O_RDWR); > unlink("foo/bar"); > write(fd, buf, sizeof(buf)); > system("df ."); /* 20Mb eaten */ > close(fd); > system("df ."); /* should've freed those 20Mb */ > flush_dcache(); > system("df ."); /* should be the same as #2 */ > } > > will spit out something like > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/root 322023 303843 1131 100% / > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/root 322023 303843 1131 100% / > Filesystem 1K-blocks Used Available Use% Mounted on > /dev/root 322023 283282 21692 93% / > - inode gets freed only when dentry is finally evicted (here we trigger > than by remount; normally it would've happened in response to memory > pressure hell knows when). > > IMO it's a bug. Definitely agreed. > Between the close() and final flush_dcache() the file has > no surviving links, is *not* busy, won't show up in fuser/lsof/whatnot > output, and yet it's still not freed. I'm not saying that it's realistically > exploitable (albeit with nfsd it might be) I'd be surprised if it doesn't happen. This, for example, is a "space on nfs export not freed when I expected it after an unlink" report that could easily have the same cause: https://www.redhat.com/archives/linux-cluster/2015-May/msg00000.html (Not sure, I'll get back to them and see if they can confirm.) > but it's a very unpleasant self-LART. > FWIW, my prefered fix would be simply to have the final dput() treat > disconnected dentries as "kill on sight"; checking for i_nlink of the > inode, as Bruce suggested several years ago, will *not* work, simply > because having another link to that file and unlinking it after close > will reproduce the situation - disconnected dentry sticks around in > dcache past its final dput() and past the last unlink() of our file. > Theoretically it might cause an overhead for nfsd (no_subtree_check v3 > export might see more d_alloc()/d_free(); icache retention will still > prevent constant rereading the inode from disk). _IF_ that proves to > be noticable, we might need to come up with something more elaborate > (e.g. have unlink() and rename() kick disconnected aliases out if the link > count has reached 0), but it's more complex and needs careful ananlysis > of correctness - we need to prove that there's no way to miss the link > count reaching 0. I would prefer to treat all disconnected as unhashed > for dcache retention purposes - it's simpler and less brittle. Comments? > I mean something like this: ACK from me. --b. > > diff --git a/fs/dcache.c b/fs/dcache.c > index 7a3f3e5..5c8ea15 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -642,7 +642,7 @@ static inline bool fast_dput(struct dentry *dentry) > > /* > * If we have a d_op->d_delete() operation, we sould not > - * let the dentry count go to zero, so use "put__or_lock". > + * let the dentry count go to zero, so use "put_or_lock". > */ > if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) > return lockref_put_or_lock(&dentry->d_lockref); > @@ -697,7 +697,7 @@ static inline bool fast_dput(struct dentry *dentry) > */ > smp_rmb(); > d_flags = ACCESS_ONCE(dentry->d_flags); > - d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST; > + d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST | DCACHE_DISCONNECTED; > > /* Nothing to do? Dropping the reference was all we needed? */ > if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry)) > @@ -776,6 +776,9 @@ repeat: > if (unlikely(d_unhashed(dentry))) > goto kill_it; > > + if (unlikely(dentry->d_flags & DCACHE_DISCONNECTED)) > + goto kill_it; > + > if (unlikely(dentry->d_flags & DCACHE_OP_DELETE)) { > if (dentry->d_op->d_delete(dentry)) > goto kill_it; -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html