On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > What I wanted to do is to understand whether there's still chance to > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > events when doing the test? Let me know if you figure out why it happened > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > with such questions. > > > > > > The major problem I had with this series is it's definitely not a clean > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > IIUC that's also the concern from Nadav. > > > > This is a long thread, so apologies if the following has already been discussed. > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > userspace to create a userfaultfd that is tied to a specific task. Any > > userfaults encountered by that task use that fd, rather than the > > process-wide fd. I'm making the assumption here that each of these fds > > would have independent signaling mechanisms/queues and so this would > > solve the scaling problem. > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > vCPU for handling userfault requests. This seems like it'd have > > roughly the same scalability characteristics as the KVM -EFAULT > > approach. > > I think this would work in principle, but it's significantly different > from what exists today. > > The splitting of userfaultfds Peter is describing is splitting up the > HVA address space, not splitting per-thread. > > I think for this design, we'd need to change UFFD registration so > multiple UFFDs can register the same VMA, but can be filtered so they > only receive fault events caused by some particular tid(s). > > This might also incur some (small?) overhead, because in the fault > path we now need to maintain some data structure so we can lookup > which UFFD to notify based on a combination of the address and our > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. I was (perhaps naively) assuming the lookup would be as simple as: diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 44d1ee429eb0..e9856e2ba9ef 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) */ mmap_assert_locked(mm); - ctx = vma->vm_userfaultfd_ctx.ctx; + if (current->userfaultfd_ctx) + ctx = current->userfaultfd_ctx; + else + ctx = vma->vm_userfaultfd_ctx.ctx; if (!ctx) goto out; > > I think it's worth keeping in mind that a selling point of Anish's > approach is that it's a very small change. It's plausible we can come > up with some alternative way to scale, but it seems to me everything > suggested so far is likely to require a lot more code, complexity, and > effort vs. Anish's approach. Agreed. Mostly I think the per-thread UFFD approach would add complexity on the userspace side of things. With Anish's approach userspace is able to trivially re-use the vCPU thread (and it's associated pCPU if pinned) to handle the request. That gets more complicated when juggling the extra paired threads. The per-thread approach would requires a new userfault UAPI change which I think is a higher bar than the KVM UAPI change proposed here. The per-thread approach would require KVM call into slow GUP and take the mmap_lock before contacting userspace. I'm not 100% convinced that's a bad thing long term (e.g. it avoids the false-positive -EFAULT exits in Anish's proposal), but could have performance implications. Lastly, inter-thread communication is likely slower than returning to userspace from KVM_RUN. So the per-thread approach might increase the end-to-end latency of demand fetches.