Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes: > > > You obviously are aware of this but I'm just spelling it out. Iirc, > > > there will practically only ever be one light refcount per file. > > > > > > For a light refcount to be used we know that the file descriptor table > > > isn't shared with any other task. So there are no threads that could > > > concurrently access the file descriptor table. We also know that the > > > file descriptor table cannot become shared while we're in system call > > > context because the caller can't create new threads and they can't > > > unshare the file descriptor table. > > > > > > So there's only one fdget() caller (Yes, they could call fdget() > > > multiple times and then have to do fdput() multiple times but that's a > > > level of weirdness that we don't need to worry about.). > > > > Hmm. Is it not the case that different processes with different file > > descriptor tables could reference the same underlying `struct file` and > > both use light refcounts to do so, as long as each fd table is not > > shared? So there could be multiple light refcounts to the same `struct > > file` at the same time on different threads. > > Relevant rules: > > * Each file pointer in any descriptor table contributes to refcount > of file. > > * All assignments to task->files are done by the task itself or, > during task creation, by its parent The latter happens before the task > runs for the first time. The former is done with task_lock(current) > held. > > * current->files is always stable. The object it points to > is guaranteed to stay alive at least until you explicitly change > current->files. > * task->files is stable while you are holding task_lock(task). > The object it points to is guaranteed to stay alive until you release > task_lock(task). > * task->files MAY be fetched (racily) without either of the > above, but it should not be dereferenced - the memory may be freed > and reused right after you've fetched the pointer. > > * descriptor tables are refcounted by table->count. > * descriptor table is created with ->count equal to 1 and > destroyed when its ->count reaches 0. > * each task with task->files == table contributes to table->count. > * before the task dies, its ->files becomes NULL (see exit_files()). > * when task is born (see copy_process() and copy_files())) the parent > is responsible for setting the value of task->files and making sure that > refcounts are correct; that's the only case where one is allowed to acquire > an extra reference to existing table (handling of clone(2) with COPY_FILES). > > * the only descriptor table one may modify is that pointed to > by current->files. Any access to other threads' descriptor tables is > read-only. > > * struct fd is fundamentally thread-local. It should never be > passed around, put into shared data structures, etc. > > * if you have done fdget(N), the matching fdput() MUST be done > before the caller modifies the Nth slot of its descriptor table, > spawns children that would share the descriptor table. > > * fdget() MAY borrow a reference from caller's descriptor table. > That can be done if current->files->count is equal to 1. > In that case we can be certain that the file reference we fetched from > our descriptor table will remain unchanged (and thus contributing to refcount > of file) until fdput(). Indeed, > + at the time of fdget() no other thread has task->files pointing > to our table (otherwise ->count would be greater than 1). > + our thread will remain the sole owner of descriptor table at > least until fdput(). Indeed, the first additional thread with task->files > pointing to our table would have to have been spawned by us and we are > forbidden to do that (rules for fdget() use) > + no other thread could modify our descriptor table (they would > have to share it first). > + we are allowed to modify our table, but we are forbidden to touch > the slot we'd copied from (rules for fdget() use). > > In other words, if current->files->count is equal to 1 at fdget() time > we can skip incrementing refcount. Matching fdput() would need to > skip decrement, of course. Note that we must record that (borrowed > vs. cloned) in struct fd - the condition cannot be rechecked at fdput() > time, since the table that had been shared at fdget() time might no longer > be shared by the time of fdput(). This is great! It matches my understanding. I didn't know the details about current->files and task->files. You should copy this to the kernel documentation somewhere. :) > > And this does *not* apply to `fdget_pos`, which checks the refcount of > > the `struct file` instead of the refcount of the fd table. > > False. fdget_pos() is identical to fdget() as far as file refcount > handling goes. The part that is different is that grabbing ->f_pos_lock > is sensitive to file refcount in some cases. This is orthogonal to > "does this struct fd contribute to file refcount". Sorry, I see now that I didn't phrase that quite right. What I meant is that there are ways of sharing a `struct file` reference during an fdget scope that are not dangerous, but where it *would be* dangerous if it was an fdget_pos scope instead. Specifically, the reason they are dangerous is that they can lead to a data race on the file position if the fdget_pos scope did not take the f_pos_lock mutex. For example, during an `fdget(N)` scope, you can always do a `get_file` and then send it to another process and `fd_install` it into that other process. There's no way that this could result in the deletion of the Nth entry of `current->files`. However, during an `fdget_pos(N)` scope, then it is *not* the case that it's always okay to send a `get_file` reference to another thread and `fd_install` it. Because after the remote process returns from the syscall in which we `fd_install`ed the file, the remote process could proceed to call another syscall that in turn modifies the file position. And if the original `fdget_pos(N)` scope modifies the file position after sending the `get_file` reference, then that could be a data race on f_pos. > Again, "light" references are tied to thread; they can only be created > if we are guaranteed that descriptor table's slot they came from will > remain unchanged for as long as the reference is used. > > And yes, there may be several light references to the same file - both > in different processes that do not share descriptor table *and* in the > same thread, if e.g. sendfile(in_fd, out_fd, ...) is called with > in_fd == out_fd. Thanks for confirming this! I hope this reply along with my reply to Christian Brauner also addresses your other thread. Let me know if it doesn't. Alice