On Sun, Aug 04, 2024 at 10:13:22PM +0100, Al Viro wrote: > On Sun, Aug 04, 2024 at 08:18:14AM -0700, Linus Torvalds wrote: > > On Sat, 3 Aug 2024 at 20:47, Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote: > > > > > > bitmap_copy_and_extend(nfdt->open_fds, ofdt->open_fds, > > > copy_words * BITS_PER_LONG, nwords * BITS_PER_LONG); > > > > Ok, thinking about it, I like this interface, since it looks like the > > compiler would see that it's in BITS_PER_LONG chunks if we inline it > > and decide it's important. > > > > So make it so. > > FWIW, what I'm testing right now is the patch below; it does fix the reproducer > and it seems to be having no problems with xfstests so far. Needs profiling; > again, no visible slowdowns on xfstests, but... Seems to work, no visible slowdowns. Pushed into #fixes. BTW, speaking of fdtable, the longer I look at it, the more I'm tempted to try and kill it off completely. Look: ->fdt->max_fds is non-decreasing ->fdt->full_fds_bits only accessed under ->files_lock ->fdt->open_fds is sometimes used with rcu_read_lock() alone, but such users are very few - it's bitmap_weight(fdt->open_fds, fdt->max_fds); in proc_readfd_count() (stat on /proc/*/fd) and max_select_fds(); the latter is checking if anything in the fd_set_bits passed to select(2) is not opened, in addition to looking for the maximal descriptor passed to select(2) in that. Then we do fdget() on all of them during the main loop of select(). Sure, we want -EBADF if it hadn't been opened at all and we want EPOLLNVAL on subsequent passes, but that could've been handled easily in the main loop itself. As for the bitmap_weight(), I seriously suspect that keeping the count (all updates are under ->files_lock) might be cheap enough, and IIRC we already had cgroup (or was it bpf?) folks trying to get that count in really scary ways... ->fdt->close_on_exec has two places accessing it with rcu_read_lock() alone (F_GETFD and alloc_bprm()). In both cases we have already done fdget() on the same descrpiptor. So... do we really need that indirect? The problem would be seeing ->max_fds update before that of the array pointers. Smells like that should be doable with barrier(s) - and not many, at that... Note that if we coallocate the file references' array and bitmaps, we could e.g. slap rcu_head over the beginning of ->open_fds bitmap after the sucker's gone from files_struct and do RCU-delayed freeing on that. Am I missing something obvious here?