On Tue, Sep 5, 2023 at 2:42 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > Userfaultfd is the type of file that doesn't need wake-all semantics: if > there is a message enqueued (for either a fault address, or an event), we > only need to wake up one service thread to handle it. Waking up more > normally means a waste of cpu cycles. Besides that, and more importantly, > that just doesn't scale. Hi Peter, I took a quick look over the series and didn't see anything objectionable. I was planning to actually test the series out and then send out R-b's, but it will take some additional time (next week). In the meantime, I was curious about the use case. A design I've seen for VM live migration is to have 1 thread reading events off the uffd, and then have many threads actually resolving the fault events that come in (e.g. fetching pages over the network, issuing UFFDIO_COPY or UFFDIO_CONTINUE, or whatever). In that design, since we only have a single reader anyway, I think this series doesn't help. But, I'm curious if you have data indicating that > 1 reader is more performant overall? I suspect it might be the case that, with "enough" vCPUs, it makes sense to do so, but I don't have benchmark data to tell me what that tipping point is yet. OTOH, if one reader is plenty in ~all cases, optimizing this path is less important. > > Andrea used to have one patch that made read() to be O(1) but never hit > upstream. This is my effort to try upstreaming that (which is a > oneliner..), meanwhile on top of that I also made poll() O(1) on wakeup, > too (more or less bring EPOLLEXCLUSIVE to poll()), with some tests showing > that effect. > > To verify this, I added a test called uffd-perf (leveraging the refactored > uffd selftest suite) that will measure the messaging channel latencies on > wakeups, and the waitqueue optimizations can be reflected by the new test: > > Constants: 40 uffd threads, on N_CPUS=40, memsize=512M > Units: milliseconds (to finish the test) > |-----------------+--------+-------+------------| > | test case | before | after | diff (%) | > |-----------------+--------+-------+------------| > | workers=8,poll | 1762 | 1133 | -55.516328 | > | workers=8,read | 1437 | 585 | -145.64103 | > | workers=16,poll | 1117 | 1097 | -1.8231541 | > | workers=16,read | 1159 | 759 | -52.700922 | > | workers=32,poll | 1001 | 973 | -2.8776978 | > | workers=32,read | 866 | 713 | -21.458626 | > |-----------------+--------+-------+------------| > > The more threads hanging on the fd_wqh, a bigger difference will be there > shown in the numbers. "8 worker threads" is the worst case here because it > means there can be a worst case of 40-8=32 threads hanging idle on fd_wqh > queue. > > In real life, workers can be more than this, but small number of active > worker threads will cause similar effect. > > This is currently based on Andrew's mm-unstable branch, but assuming this > is applicable to most of the not-so-old trees. > > Comments welcomed, thanks. > > Andrea Arcangeli (1): > mm/userfaultfd: Make uffd read() wait event exclusive > > Peter Xu (6): > poll: Add a poll_flags for poll_queue_proc() > poll: POLL_ENQUEUE_EXCLUSIVE > fs/userfaultfd: Use exclusive waitqueue for poll() > selftests/mm: Replace uffd_read_mutex with a semaphore > selftests/mm: Create uffd_fault_thread_create|join() > selftests/mm: uffd perf test > > drivers/vfio/virqfd.c | 4 +- > drivers/vhost/vhost.c | 2 +- > drivers/virt/acrn/irqfd.c | 2 +- > fs/aio.c | 2 +- > fs/eventpoll.c | 2 +- > fs/select.c | 9 +- > fs/userfaultfd.c | 8 +- > include/linux/poll.h | 25 ++- > io_uring/poll.c | 4 +- > mm/memcontrol.c | 4 +- > net/9p/trans_fd.c | 3 +- > tools/testing/selftests/mm/Makefile | 2 + > tools/testing/selftests/mm/uffd-common.c | 65 +++++++ > tools/testing/selftests/mm/uffd-common.h | 7 + > tools/testing/selftests/mm/uffd-perf.c | 207 +++++++++++++++++++++++ > tools/testing/selftests/mm/uffd-stress.c | 53 +----- > virt/kvm/eventfd.c | 2 +- > 17 files changed, 337 insertions(+), 64 deletions(-) > create mode 100644 tools/testing/selftests/mm/uffd-perf.c > > -- > 2.41.0 >