On Wed, Sep 27, 2023 at 11:05:37AM -0700, Linus Torvalds wrote: > On Wed, 27 Sept 2023 at 10:56, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote: > > > > Comments in the patch explicitly mention dodgin RCU for the file object. > > Not the commit message,. and the comment is also actually pretty > obscure and only talks about the freeing part. > How about this: ================== cut here ================== vfs: shave work on failed file open Failed opens (mostly ENOENT) legitimately happen a lot, for example here are stats from stracing kernel build for few seconds (strace -fc make): % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ------------------ 0.76 0.076233 5 15040 3688 openat (this is tons of header files tried in different paths) In the common case of there being nothing to close (only the file object to free) there is a lot of overhead which can be avoided. This boils down to 2 items: 1. avoiding delegation of fput to task_work, see 021a160abf62 ("fs: use __fput_sync in close(2)" for more details on overhead) 2. avoiding freeing the file with RCU Benchmarked with will-it-scale with a custom testcase based on tests/open1.c, stuffed into tests/openneg.c: [snip] while (1) { int fd = open("/tmp/nonexistent", O_RDONLY); assert(fd == -1); (*iterations)++; } [/snip] Sapphire Rapids, openneg_processes -t 1 (ops/s): before: 1950013 after: 2914973 (+49%) file refcount is checked with an atomic cmpxchg as a safety belt against buggy consumers. Technically it is not necessary, but it happens to not be measurable due to several other atomics which immediately follow. Optmizing them away to make this atomic into a problem is left as an exercise for the reader. ================== cut here ================== Comment in v2 is: /* * Clean up after failing to open (e.g., open(2) returns with -ENOENT). * * This represents opportunities to shave on work in the common case of * FMODE_OPENED not being set: * 1. there is nothing to close, just the file object to free and consequently * no need to delegate to task_work * 2. as nobody else had seen the file then there is no need to delegate * freeing to RCU */ I don't see anything wrong with it as far as information goes. > > Well put_cred is called synchronously, but should this happen to be > > the last ref on them, they will get call_rcu(&cred->rcu, > > put_cred_rcu)'ed. > > Yes. But the way it's done in __fput() you end up potentially > RCU-delaying it twice. Odd. > > The reason we rcu-delay the 'struct file *' is because of the > __fget_files_rcu() games. > > But I don't see why the cred thing is there. > > Historical mistake? But it all looks a bit odd, and because of that it > worries me. > put_cred showed up in file_free_rcu in d76b0d9b2d87 ("CRED: Use creds in file structs"). Commit message does not claim any dependency on this being in an rcu callback already and it looks like it was done this way because this was the ony spot with kmem_cache_free(filp_cachep, f) -- you ensured put_cred was always called without inspecting any other places. If there is something magic going on here I don't see it, it definitely was not intended at least.