On Mon, May 27, 2024 at 04:03:56PM +0000, Alice Ryhl wrote: > > In other words, if current->files->count is equal to 1 at fdget() time > > we can skip incrementing refcount. Matching fdput() would need to > > skip decrement, of course. Note that we must record that (borrowed > > vs. cloned) in struct fd - the condition cannot be rechecked at fdput() > > time, since the table that had been shared at fdget() time might no longer > > be shared by the time of fdput(). > > This is great! It matches my understanding. I didn't know the details > about current->files and task->files. > > You should copy this to the kernel documentation somewhere. :) Probably, after it's turned into something more coherent - and after the description of struct fd scope rules is corrected ;-/ Correction in question: you _are_ allowed to move reference from descriptor table while in scope of struct fd; what you are not allowed is dropping that reference until the end of scope. That's what binder_do_fd_close() is about - binder_deferred_fd_close() is called in a struct fd scope (anything called from ->unlocked_ioctl() instances is). It *does* remove a struct file reference from descriptor table: twcb->file = file_close_fd(fd); moves that reference to twcb->file, and subsequent get_file(twcb->file); filp_close(twcb->file, current->files); completes detaching it from the table[*] and the reference itself is dropped via task_work, which is going to be executed on the way out to userland, definitely after we leave the scope of struct fd. Incidentally, I'm very tempted to unexport close_fd(); in addition to being a source of bugs when called from ioctl on user-supplied descriptor it encourages racy crap - just look at e.g. 1819200166ce "drm/amdkfd: Export DMABufs from KFD using GEM handles", where we call drm_gem_prime_handle_to_fd(), immediately followed by dmabuf = dma_buf_get(fd); close_fd(fd); dup2() from another thread with guessed descriptor number as target and you've got a problem... It's not a violation of fdget() use rules (it is called from ioctl, but descriptor is guaranteed to be different from the one passed to ioctl(2)), but it's still wrong. Would take some work, though... "Detaching from the table" bit also needs documenting, BTW. If you look at that thing, you'll see that current->files is converted to fl_owner_t, which is an opaque pointer. What happens is that dnotify and POSIX lock use current->files as opaque tags (->dn_owner and ->flc_owner resp.) and filp_close() (well, filp_flush() these days) needs to be called to purge all of those associated with given struct file and given tag value. That needs to be done between removal of file reference from table and destruction of the table itself and it guarantees that those opaque references won't outlive the table (more importantly, don't survive until a different files_struct instance is allocated at the same address). [*] NB: might make sense to export filp_flush(), since that's what this sequence boils down to. We really need a better identifier, though - "filp" part is a leftover from OSDI, AFAICT; that's a hungarism for "file pointer" and it makes no sense. file_flush() would be better, IMO - or flush_file(), for that matter.