On Fri, Jul 14, 2017 at 05:40:48PM +0100, Suzuki K Poulose wrote: > On 06/07/17 10:42, Christoffer Dall wrote: > >On Thu, Jul 06, 2017 at 10:34:58AM +0100, Suzuki K Poulose wrote: > >>On 06/07/17 08:45, Christoffer Dall wrote: > >>>On Thu, Jul 06, 2017 at 09:07:49AM +0200, Alexander Graf wrote: > >>>> > >>>> > >>>>On 05.07.17 10:57, Suzuki K Poulose wrote: > >>>>>Hi Alex, > >>>>> > >>>>>On Wed, Jul 05, 2017 at 08:20:31AM +0200, Alexander Graf wrote: > >>>>>>The kvm_age_hva callback may be called all the way concurrently while > >>>>>>kvm_mmu_notifier_release() is running. > >>>>>> > >>>>>>The release function sets kvm->arch.pgd = NULL which the aging function > >>>>>>however implicitly relies on in stage2_get_pud(). That means they can > >>>>>>race and the aging function may dereference a NULL pgd pointer. > >>>>>> > >>>>>>This patch adds a check for that case, so that we leave the aging > >>>>>>function silently. > >>>>>> > >>>>>>Cc: stable@xxxxxxxxxxxxxxx > >>>>>>Fixes: 293f29363 ("kvm-arm: Unmap shadow pagetables properly") > >>>>>>Signed-off-by: Alexander Graf <agraf@xxxxxxx> > >>>>>> > >>>>>>--- > >>>>>> > >>>>>>v1 -> v2: > >>>>>> > >>>>>> - Fix commit message > >>>>>> - Add Fixes and stable tags > >>>>>>--- > >>>>>>virt/kvm/arm/mmu.c | 4 ++++ > >>>>>>1 file changed, 4 insertions(+) > >>>>>> > >>>>>>diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > >>>>>>index f2d5b6c..227931f 100644 > >>>>>>--- a/virt/kvm/arm/mmu.c > >>>>>>+++ b/virt/kvm/arm/mmu.c > >>>>>>@@ -861,6 +861,10 @@ static pud_t *stage2_get_pud(struct kvm *kvm, struct kvm_mmu_memory_cache *cache > >>>>>> pgd_t *pgd; > >>>>>> pud_t *pud; > >>>>>>+ /* Do we clash with kvm_free_stage2_pgd()? */ > >>>>>>+ if (!kvm->arch.pgd) > >>>>>>+ return NULL; > >>>>>>+ > >>>>> > >>>>>I think this check should be moved up in the chain. We call kvm_age_hva(), with > >>>>>the kvm->mmu_lock held and we don't release it till we reach here. So, ideally, > >>>>>if we find the PGD is null when we reach kvm_age_hva(), we could simply return > >>>>>there, like we do for other call backs from the KVM mmu_notifier. > >>>> > >>>>That probably works too - I'm not sure which version is more > >>>>consistent as well as more maintainable in the long run. I'll leave > >>>>the call here to Christoffer. > >>>> > >>> > >>>Let's look at the callers to stage2_get_pmd, which is the only caller of > >>>stage2_get_pud, where the problem was observed: > >>> > >>> user_mem_abort > >>> -> stage2_set_pmd_huge > >>> -> stage2_get_pmd > >>> > >>> user_mem_abort > >>> -> stage2_set_pte > >>> -> stage2_get_pmd > >>> > >>> handle_access_fault > >>> -> stage2_get_pmd > >>> > >>>For the above three functions, pgd cannot ever be NULL, because this is > >>>running in the context of a VCPU thread, which means the reference on > >>>the VM fd must not reach zero, so no need to call that here. > >> > >>I think there is some problem here. See below for more information. > >> > >>> > >>> kvm_set_spte_handler > >>> -> stage2_set_pte > >>> -> stage2_get_pmd > >>> > >>>This is called from kvm_set_spte_hva, which is one of the MMU notifiers, > >>>so it can race similarly kvm_age_hva and kvm_test_age_hva, but it > >>>already checks for !kvm->arch.pgd. > >>> > >>> kvm_phys_addr_ioremap > >>> -> stage2_set_pte > >>> -> stage2_get_pmd > >>> > >>>This is called from two places: (1) The VGIC code (as part of > >>>vgic_v2_map_resources) and can only be called in the context of running > >>>a VCPU, so the pgd cannot be null by virtue of the same argument as for > >>>user_mem_abort. (2) kvm_arch_prepare_memory_region calls > >>>kvm_phys_addr_ioremap, which is a VM ioctl so similarly, I cannot see > >>>how the VM can be in the middle of being freed while handling ioctls on > >>>the fd. Therefore, following the same argument, this should be safe as > >>>well. > >>> > >>> kvm_age_hva_handler and kvm_test_age_hva_handler > >>> -> stage2_get_pmd > >>> > >>>Handled by the patch proposed by Suzuki. > >>> > >>>What does all that tell us? First, it should give us some comfort that we > >>>don't have more races of this kind. Second, it teels us that there are > >>>a number of different and not-obvious call paths to stage2_pet_pud, > >>>which could be an argument to simply check the pgd whenever it's called, > >>>despite the minor runtime overhead. On the other hand, the check itself > >>>is only valid knowing that we synchronize against kvm_free_stage2_pgd > >>>using the kvm->mmu_lock() and understanding that this only happens when > >>>mmu notifiers call into the KVM MMU code outside the context of the VM. > >>> > >>>The last consideration is the winning argument for me to put the check > >>>in kvm_age_hva_handler and kvm_test_age_hva_handler, but I think it's > >>>important that we document why it's only these three high-level callers > >>>(incl. kvm_set_spte_handler) that need the check, either in the code or > >>>in the commit message. > >> > >>The only way we end up freeing the stage-2 PGD is via the mmu_notifier_release(), > >>which could be triggered via two different paths. > >> > >>1) kvm_destroy_vm(), where all the VM resources has been released and the > >>refcount on the KVM instances are dropped, via kvm_put_kvm(). > >> > >>kvm_put_kvm() > >> kvm_destroy_vm() > >> mmu_notifier_unregsiter > >> mmu_notifier_ops->release() > >> kvm_arch_flush_shadow_all > >> kvm_free_stage2_pgd -> free the page table with the mmu_lock held > >> occasionally releasing it to avoid contention. > >>or > >> > >>2) do_signal -> get_signal -> do_group_exit - > > >> do_exit > >> exit_mm > >> mmput => __mmput > >> exit_mmap > >> mmu_notifier_release > >> mmu_notifier_ops->release > >> kvm_arch_flush_shadow_all > >> kvm_free_stage2_pgd > >> > >>In the first case, all references to the VM are dropped and hence none of the > >>VCPU could still be executing. > >> > >>However, in the second case it may not be. So we have a potential problem with > >>the VCPU trying to run even when the pages were unmapped. I think the root cause > >>of all these issues boils down to the assumption that KVM holds a reference to > >>MM (which is not necessarily the user space mapping. i.e mmgrab vs mmget). > >>I am not sure if the VCPU should hold a reference to the mmaps to make sure > >>it is safe to run. That way, the mmap stays until the VCPU eventually exits > >>and we are safe all the way around. > > > >Hmmm, my assumption is that if a VCPU is running, it means there is a > >VCPU thread that shares the struct mm which is running, so I don't > >understand how mmput would be able to call exit_mmap in the scenario > >above? > > > >So the distinction here is that I don't assume that the VCPU fd holds a > >reference to the mm, but I assume that the (running) VCPU thread does. > >Is this incorrect? > > Sorry, I lost this thread in between. No worries. > > Hmm. You're right. The VCPU should have a refcount on mmap and it shouldn't > do anything with the mmu if it has dropped it. I was confused based on an > old bug report,[ See the description of commit 293f293637b55d "kvm-arm: Unmap shadow pagetables > properly"], which was fixed. > OK, that makes sense to me. I hope Andrea also agrees with this, so as long as VCPU thread is running, exit_mm() will not be called. Thanks, -Christoffer