On Fri, Nov 02, 2018 at 10:26:37AM +0000, Suzuki K Poulose wrote: > Hi Christoffer, > > On 02/11/18 07:53, Christoffer Dall wrote: > >There are two things we need to take care of when we create block > >mappings in the stage 2 page tables: > > > > (1) The alignment within a PMD between the host address range and the > > guest IPA range must be the same, since otherwise we end up mapping > > pages with the wrong offset. > > > > (2) The head and tail of a memory slot may not cover a full block > > size, and we have to take care to not map those with block > > descriptors, since we could expose memory to the guest that the host > > did not intend to expose. > > > >So far, we have been taking care of (1), but not (2), and our commentary > >describing (1) was somewhat confusing. > > > >This commit attempts to factor out the checks of both into a common > >function, and if we don't pass the check, we won't attempt any PMD > >mappings for neither hugetlbfs nor THP. > > > >Note that we used to only check the alignment for THP, not for > >hugetlbfs, but as far as I can tell the check needs to be applied to > >both scenarios. > > > >Cc: Ralph Palutke <ralph.palutke@xxxxxx> > >Cc: Lukas Braun <koomi@xxxxxxxxxxx> > >Reported-by: Lukas Braun <koomi@xxxxxxxxxxx> > >Signed-off-by: Christoffer Dall <christoffer.dall@xxxxxxx> > >--- > > virt/kvm/arm/mmu.c | 79 +++++++++++++++++++++++++++++++++++++--------- > > 1 file changed, 64 insertions(+), 15 deletions(-) > > > >diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c > >index 5eca48bdb1a6..4e7572656b5c 100644 > >--- a/virt/kvm/arm/mmu.c > >+++ b/virt/kvm/arm/mmu.c > >@@ -1470,6 +1470,63 @@ static void kvm_send_hwpoison_signal(unsigned long address, > > send_sig_mceerr(BUS_MCEERR_AR, (void __user *)address, lsb, current); > > } > >+static bool fault_supports_stage2_pmd_mappings(struct kvm_memory_slot *memslot, > >+ unsigned long hva) > >+{ > >+ gpa_t gpa_start, gpa_end; > >+ hva_t uaddr_start, uaddr_end; > >+ size_t size; > >+ > >+ size = memslot->npages * PAGE_SIZE; > >+ > >+ gpa_start = memslot->base_gfn << PAGE_SHIFT; > >+ gpa_end = gpa_start + size; > >+ > >+ uaddr_start = memslot->userspace_addr; > >+ uaddr_end = uaddr_start + size; > >+ > >+ /* > >+ * Pages belonging to memslots that don't have the same alignment > >+ * within a PMD for userspace and IPA cannot be mapped with stage-2 > >+ * PMD entries, because we'll end up mapping the wrong pages. > >+ * > >+ * Consider a layout like the following: > >+ * > >+ * memslot->userspace_addr: > >+ * +-----+--------------------+--------------------+---+ > >+ * |abcde|fgh Stage-1 PMD | Stage-1 PMD tv|xyz| > >+ * +-----+--------------------+--------------------+---+ > >+ * > >+ * memslot->base_gfn << PAGE_SIZE: > >+ * +---+--------------------+--------------------+-----+ > >+ * |abc|def Stage-2 PMD | Stage-2 PMD |tvxyz| > >+ * +---+--------------------+--------------------+-----+ > >+ * > >+ * If we create those stage-2 PMDs, we'll end up with this incorrect > >+ * mapping: > >+ * d -> f > >+ * e -> g > >+ * f -> h > >+ */ > > Nice ! The comment makes it so easy to understand the problem. > > >+ if ((gpa_start & ~S2_PMD_MASK) != (uaddr_start & ~S2_PMD_MASK)) > >+ return false; > >+ > >+ /* > >+ * Next, let's make sure we're not trying to map anything not covered > >+ * by the memslot. This means we have to prohibit PMD size mappings > >+ * for the beginning and end of a non-PMD aligned and non-PMD sized > >+ * memory slot (illustrated by the head and tail parts of the > >+ * userspace view above containing pages 'abcde' and 'xyz', > >+ * respectively). > >+ * > >+ * Note that it doesn't matter if we do the check using the > >+ * userspace_addr or the base_gfn, as both are equally aligned (per > >+ * the check above) and equally sized. > >+ */ > >+ return (hva & S2_PMD_MASK) >= uaddr_start && > >+ (hva & S2_PMD_MASK) + S2_PMD_SIZE <= uaddr_end; > > Shouldn't this be : > > (hva & S2_PMD_MASK) + S2_PMD_SIZE < uaddr_end; > > or > > (hva & S2_PMD_MASK) + S2_PMD_SIZE <= (uaddr_end - 1); Let's consider an example. Imagine a 4MB region (1024 4K pages) start at 0, and we want to ensure that the address is within one of the 2MB pages. That means, size will be: size = 1024 * 4096 = 4194304 uaddr_start = 0 uaddr_end = 0 + 4194304 = 4194304 and the comparison check for anything in the first 2MB region becomes return 0 >= 0 && 2097152 <= 4194304; which is what we want. The comparison check for anything in the second 2MB region becomes return 2097152 >= 0 && 4194304 <= 4194304; which is also what we want. If we had done a stricly less than or subtracted one from uaddr_end this check would fail, which we don't want. Am I missing something? > > >+} > >+ > > static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > struct kvm_memory_slot *memslot, unsigned long hva, > > unsigned long fault_status) > >@@ -1495,6 +1552,12 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > return -EFAULT; > > } > >+ if (!fault_supports_stage2_pmd_mappings(memslot, hva)) > >+ force_pte = true; > >+ > >+ if (logging_active) > >+ force_pte = true; > > super minor nit: If we combine the checks, may be we could skip checking > the alignments, when logging is active. > > if (logging_active || > !fault_supports_stage2_pmd_mappings(memslot, hva)) > force_pte = true; > I prefer the way the code is written now, because logging active is a logically very separate concept from checking the boundaries, and this is realy the slow path anyway, so not sure there's any measurable benefit. More generally, I'd like user_mem_abort() to become more like: if (check1()) conditionr1 = value; if (check2()) condition2 = value; if (check3()) condition3 = value; manipulate_page_tables(condition1, condition2, condition3, ...); And I'll have a got at that following Punit's PUD huge pages. > >+ > > /* Let's check if we will get back a huge page backed by hugetlbfs */ > > down_read(¤t->mm->mmap_sem); > > vma = find_vma_intersection(current->mm, hva, hva + 1); > >@@ -1504,22 +1567,9 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > return -EFAULT; > > } > >- if (vma_kernel_pagesize(vma) == PMD_SIZE && !logging_active) { > >+ if (vma_kernel_pagesize(vma) == PMD_SIZE && !force_pte) { > > hugetlb = true; > > gfn = (fault_ipa & PMD_MASK) >> PAGE_SHIFT; > >- } else { > >- /* > >- * Pages belonging to memslots that don't have the same > >- * alignment for userspace and IPA cannot be mapped using > >- * block descriptors even if the pages belong to a THP for > >- * the process, because the stage-2 block descriptor will > >- * cover more than a single THP and we loose atomicity for > >- * unmapping, updates, and splits of the THP or other pages > >- * in the stage-2 block range. > >- */ > >- if ((memslot->userspace_addr & ~PMD_MASK) != > >- ((memslot->base_gfn << PAGE_SHIFT) & ~PMD_MASK)) > >- force_pte = true; > > } > > up_read(¤t->mm->mmap_sem); > >@@ -1558,7 +1608,6 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa, > > * should not be mapped with huge pages (it introduces churn > > * and performance degradation), so force a pte mapping. > > */ > >- force_pte = true; > > flags |= KVM_S2_FLAG_LOGGING_ACTIVE; > > Otherwise looks good to me. > Thanks! Christoffer _______________________________________________ kvmarm mailing list kvmarm@xxxxxxxxxxxxxxxxxxxxx https://lists.cs.columbia.edu/mailman/listinfo/kvmarm