On Thu, May 11, 2023 at 12:45:32PM -0700, Axel Rasmussen wrote: > On Thu, May 11, 2023 at 12:05 PM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > On Thu, May 11, 2023 at 10:33:24AM -0700, Axel Rasmussen wrote: > > > On Thu, May 11, 2023 at 10:18 AM David Matlack <dmatlack@xxxxxxxxxx> wrote: > > > > > > > > On Wed, May 10, 2023 at 2:50 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > On Tue, May 09, 2023 at 01:52:05PM -0700, Anish Moorthy wrote: > > > > > > On Sun, May 7, 2023 at 6:23 PM Peter Xu <peterx@xxxxxxxxxx> wrote: > > > > > > > > > > What I wanted to do is to understand whether there's still chance to > > > > > provide a generic solution. I don't know why you have had a bunch of pmu > > > > > stack showing in the graph, perhaps you forgot to disable some of the perf > > > > > events when doing the test? Let me know if you figure out why it happened > > > > > like that (so far I didn't see), but I feel guilty to keep overloading you > > > > > with such questions. > > > > > > > > > > The major problem I had with this series is it's definitely not a clean > > > > > approach. Say, even if you'll all rely on userapp you'll still need to > > > > > rely on userfaultfd for kernel traps on corner cases or it just won't work. > > > > > IIUC that's also the concern from Nadav. > > > > > > > > This is a long thread, so apologies if the following has already been discussed. > > > > > > > > Would per-tid userfaultfd support be a generic solution? i.e. Allow > > > > userspace to create a userfaultfd that is tied to a specific task. Any > > > > userfaults encountered by that task use that fd, rather than the > > > > process-wide fd. I'm making the assumption here that each of these fds > > > > would have independent signaling mechanisms/queues and so this would > > > > solve the scaling problem. > > > > > > > > A VMM could use this to create 1 userfaultfd per vCPU and 1 thread per > > > > vCPU for handling userfault requests. This seems like it'd have > > > > roughly the same scalability characteristics as the KVM -EFAULT > > > > approach. > > > > > > I think this would work in principle, but it's significantly different > > > from what exists today. > > > > > > The splitting of userfaultfds Peter is describing is splitting up the > > > HVA address space, not splitting per-thread. > > > > > > I think for this design, we'd need to change UFFD registration so > > > multiple UFFDs can register the same VMA, but can be filtered so they > > > only receive fault events caused by some particular tid(s). > > > > > > This might also incur some (small?) overhead, because in the fault > > > path we now need to maintain some data structure so we can lookup > > > which UFFD to notify based on a combination of the address and our > > > tid. Today, since VMAs and UFFDs are 1:1 this lookup is trivial. > > > > I was (perhaps naively) assuming the lookup would be as simple as: > > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > > index 44d1ee429eb0..e9856e2ba9ef 100644 > > --- a/fs/userfaultfd.c > > +++ b/fs/userfaultfd.c > > @@ -417,7 +417,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > */ > > mmap_assert_locked(mm); > > > > - ctx = vma->vm_userfaultfd_ctx.ctx; > > + if (current->userfaultfd_ctx) > > + ctx = current->userfaultfd_ctx; > > + else > > + ctx = vma->vm_userfaultfd_ctx.ctx; > > if (!ctx) > > goto out; This is an interesting idea, but I'll just double check before I grow task_struct and see whether that's the only solution. :) I'd start with hash(tid) or even hash(pcpu) to choose queue. In a pinned use case hash(pcpu) should probably reach a similar goal here (and I'd guess hash(tid) too if vcpus are mostly always created in one shot, just slightly trickier). > > Hmm, perhaps. It might have to be more complicated if we want to allow > a single task to have both per-TID UFFDs for some addresses, and > "global" UFFDs for others. > > > > Actually, while thinking about this, another wrinkle: > > Imagine we have per-thread UFFDs. Thread X faults on some address, and > goes to sleep waiting for its paired resolver thread to resolve the > fault. > > In the meantime, thread Y also faults on the same address, before the > resolution happens. > > In the existing model, there is a single UFFD context per VMA, and > therefore a single wait queue for all threads to wait on. In the > per-TID-UFFD design, now each thread has its own context, and > ostensibly its own wait queue (since the wait queue locks are where > Anish saw the contention, I think this is exactly what we want to > split up). When we have this "multiple threads waiting on the same > address" situation, how do we ensure the fault is resolved exactly > once? And how do we wake up all of the sleeping threads when it is > resolved? We probably need to wake them one by one in that case. The 2nd-Nth UFFDIO_COPY/CONTINUE will fail with -EEXIST anyway, then the userapp will need a UFFDIO_WAKE I assume. > > I'm sure it's solvable, but especially doing it without any locks / > contention seems like it could be a bit complicated. IMHO no complicated locking is needed. Here the "complicated locking" is done with pgtable lock and it should be reflected by -EEXISTs to userapp. > > > > > > > > > I think it's worth keeping in mind that a selling point of Anish's > > > approach is that it's a very small change. It's plausible we can come > > > up with some alternative way to scale, but it seems to me everything > > > suggested so far is likely to require a lot more code, complexity, and > > > effort vs. Anish's approach. > > > > Agreed. > > > > Mostly I think the per-thread UFFD approach would add complexity on the > > userspace side of things. With Anish's approach userspace is able to > > trivially re-use the vCPU thread (and it's associated pCPU if pinned) to > > handle the request. That gets more complicated when juggling the extra > > paired threads. > > > > The per-thread approach would requires a new userfault UAPI change which > > I think is a higher bar than the KVM UAPI change proposed here. > > > > The per-thread approach would require KVM call into slow GUP and take > > the mmap_lock before contacting userspace. I'm not 100% convinced that's > > a bad thing long term (e.g. it avoids the false-positive -EFAULT exits > > in Anish's proposal), but could have performance implications. > > > > Lastly, inter-thread communication is likely slower than returning to > > userspace from KVM_RUN. So the per-thread approach might increase the > > end-to-end latency of demand fetches. Right. The overhead here is (IMHO): - KVM solution: vcpu exit -> enter again, whatever happens in the procedure of exit/enter will count. - mm solution: at least schedule overhead, meanwhile let's hope we can scale first elsewhere (and I'm not sure if there'll be other issues, e.g., even if we can split the uffd queues, hopefully totally nowhere I overlooked that still need the shared uffd context) I'm not sure which one will be higher or maybe it depends (e.g., some specific cases where vcpu KVM_RUN can have higher overhead when loading?). Thanks, -- Peter Xu