On Wed, Nov 29, 2017 at 12:24:55PM -0800, Linus Torvalds wrote: > Ugh. The inode freeing really is confusing and fairly involved, but > the last free *should* happen as part of the final dput() that is done > at the end of __fput(). Note that struct socket is coallocated with its inode. _Normally_ from sock_alloc() (and that's the case here, apparently), but in several cases it's embedded into another object. TUN and TAP - definitely, might have been other added. Those should never be passed to sock_release() at all. > So in __fput() calls into the > > if (file->f_op->release) > file->f_op->release(inode, file); > > then the inode should still be around, because the final ref won't be > done until later. And RCU simply shouldn't be an issue, because of > that reference count on the inode. > > So it smells like some reference counting went wrong. The socket inode > creation is a bit confusing, and then in "sock_release()" we do have > that > > if (!sock->file) { > iput(SOCK_INODE(sock)); > return; > } > sock->file = NULL; > > which *also* tries to free the inode. I'm not sure what the logic (and > what the locking) behind that code all is. If socket has never gone through sock_alloc_file(), sock_release() on it is called explicitly and frees the sucker. If it has been through sock_alloc_file(), we must not call sock_release() directly and freeing is done by iput() from final fput(). > What *is* the locking for "sock->file" anyway? Pretty much assign-once - zeroing it in the end of sock_release() is pure cosmetics (we'd damn better have no other references to that sucker left anywhere; there's still a reference to embedded inode, but that's it). FWIW, looking through the callers of sock_alloc_file()... we might be better off if it did sock_release() on failure. Then the calling conventions become "sock_alloc_file() means not calling sock_release() directly - either it'll be done by the final fput() on resulting file, or by sock_alloc_file() itself". Look: 1) in lustre: sock_filp = sock_alloc_file(sock, 0, NULL); if (IS_ERR(sock_filp)) { sock_release(sock); rc = PTR_ERR(sock_filp); goto out; } 2) in net/9p: file = sock_alloc_file(csocket, 0, NULL); if (IS_ERR(file)) { pr_err("%s (%d): failed to map fd\n", __func__, task_pid_nr(current)); sock_release(csocket); kfree(p); return PTR_ERR(file); } 3) in sctp: *newfile = sock_alloc_file(newsock, 0, NULL); if (IS_ERR(*newfile)) { put_unused_fd(retval); sock_release(newsock); retval = PTR_ERR(*newfile); *newfile = NULL; return retval; } 4) in accept4(): newfile = sock_alloc_file(newsock, flags, sock->sk->sk_prot_creator->name); if (IS_ERR(newfile)) { err = PTR_ERR(newfile); put_unused_fd(newfd); sock_release(newsock); goto out_put; } 5) called in sock_map_fd(), and the sole caller is retval = sock_map_fd(sock, flags & (O_CLOEXEC | O_NONBLOCK)); if (retval < 0) goto out_release; ... out_release: sock_release(sock); return retval; (with no fallthrough or other goto into out_release) 6) the second caller in socketpair(): newfile2 = sock_alloc_file(sock2, flags, NULL); if (IS_ERR(newfile2)) { err = PTR_ERR(newfile2); goto out_fput_1; } ... out_fput_1: fput(newfile1); put_unused_fd(fd2); put_unused_fd(fd1); sock_release(sock2); goto out; (again, no fallthrough or other goto into out_fput_1) 7) the first caller in socketpair(): newfile1 = sock_alloc_file(sock1, flags, NULL); if (IS_ERR(newfile1)) { err = PTR_ERR(newfile1); goto out_put_unused_both; } ... out_put_unused_both: put_unused_fd(fd2); out_put_unused_1: put_unused_fd(fd1); out_release_both: sock_release(sock2); out_release_1: sock_release(sock1); out: return err; No fallthrough or goto either. Sure, we get a failure exit unshared, but AFAICS some reordering can simplify things quite a bit there. 8) kcm_clone(). Fucked in head - we allocate socket, then file, *THEN* sock, then attach sock to socket (already attached to file), then finally deign to initialize sock (already attached to socket, which is attached to file). And, surprise, surprise, failure exits are all wrong. Moreover, calling conventions are broken by design - after we'd put the damn file into descriptor table we return the pointer to sock to the caller. By that time it might have bloody well been destroyed by close(2) from another thread; good thing the caller doesn't use the damn thing. Unfortunately, it is doing this: if (!err) { if (copy_to_user((void __user *)arg, &info, sizeof(info))) { err = -EFAULT; sys_close(info.fd); } } which is also bogus - again, the descriptor might have been already closed by another thread, with another one put in its place, etc. The right way to do that is to do fd_install() _last_. After the last failure exit, i.e. after copy_to_user() in that case. KCM would've looked as a likely cause of that shit, if not for the fact that object had been created by socket(2). It definitely needs fixing, but that's not the cause of PITA here. Incidentally, grepping for sys_close() shows another piece of fun in net/netfilter/xt_bpf.c. Folks, ONCE DESCRIPTOR IS INSTALLED, THAT'S IT; THERE'S NO REMOVING IT ON FAILURE EXITS. sys_close() should never, ever be used that way. Sigh...