On 10/12/19 16:52, Peter Xu wrote: > On Tue, Dec 10, 2019 at 11:07:31AM +0100, Paolo Bonzini wrote: >>> I'm thinking whether I can start >>> to use this information in the next post on solving an issue I >>> encountered with the waitqueue. >>> >>> Current waitqueue is still problematic in that it could wait even with >>> the mmu lock held when with vcpu context. >> >> I think the idea of the soft limit is that the waiting just cannot >> happen. That is, the number of dirtied pages _outside_ the guest (guest >> accesses are taken care of by PML, and are subtracted from the soft >> limit) cannot exceed hard_limit - (soft_limit + pml_size). > > So the question go backs to, whether this is guaranteed somehow? Or > do you prefer us to keep the warn_on_once until it triggers then we > can analyze (which I doubt..)? Yes, I would like to keep the WARN_ON_ONCE just because you never know. Of course it would be much better to audit the calls to kvm_write_guest and figure out how many could trigger (e.g. two from the operands of an emulated instruction, 5 from a nested EPT walk, 1 from a page walk, etc.). > One thing to mention is that for with-vcpu cases, we probably can even > stop KVM_RUN immediately as long as either the per-vm or per-vcpu ring > reaches the softlimit, then for vcpu case it should be easier to > guarantee that. What I want to know is the rest of cases like ioctls > or even something not from the userspace (which I think I should read > more later..). Which ioctls? Most ioctls shouldn't dirty memory at all. >>> And if we see if the mark_page_dirty_in_slot() is not with a vcpu >>> context (e.g. kvm_mmu_page_fault) but with an ioctl context (those >>> cases we'll use per-vm dirty ring) then it's probably fine. >>> >>> My planned solution: >>> >>> - When kvm_get_running_vcpu() != NULL, we postpone the waitqueue waits >>> until we finished handling this page fault, probably in somewhere >>> around vcpu_enter_guest, so that we can do wait_event() after the >>> mmu lock released >> >> I think this can cause a race: >> >> vCPU 1 vCPU 2 host >> --------------------------------------------------------------- >> mark page dirty >> write to page >> treat page as not dirty >> add page to ring >> >> where vCPU 2 skips the clean-page slow path entirely. > > If we're still with the rule in userspace that we first do RESET then > collect and send the pages (just like what we've discussed before), > then IMHO it's fine to have vcpu2 to skip the slow path? Because > RESET happens at "treat page as not dirty", then if we are sure that > we only collect and send pages after that point, then the latest > "write to page" data from vcpu2 won't be lost even if vcpu2 is not > blocked by vcpu1's ring full? Good point, the race would become vCPU 1 vCPU 2 host --------------------------------------------------------------- mark page dirty write to page reset rings wait for mmu lock add page to ring release mmu lock ...do reset... release mmu lock page is now dirty > Maybe we can also consider to let mark_page_dirty_in_slot() return a > value, then the upper layer could have a chance to skip the spte > update if mark_page_dirty_in_slot() fails to mark the dirty bit, so it > can return directly with RET_PF_RETRY. I don't think that's possible, most writes won't come from a page fault path and cannot retry. Paolo