Hi Yanan, On 3/3/21 11:04 AM, wangyanan (Y) wrote: > Hi Alex, > > On 2021/3/3 1:13, Alexandru Elisei wrote: >> Hello, >> >> On 2/8/21 11:22 AM, Yanan Wang wrote: >>> When KVM needs to coalesce the normal page mappings into a block mapping, >>> we currently invalidate the old table entry first followed by invalidation >>> of TLB, then unmap the page mappings, and install the block entry at last. >>> >>> It will cost a long time to unmap the numerous page mappings, which means >>> there will be a long period when the table entry can be found invalid. >>> If other vCPUs access any guest page within the block range and find the >>> table entry invalid, they will all exit from guest with a translation fault >>> which is not necessary. And KVM will make efforts to handle these faults, >>> especially when performing CMOs by block range. >>> >>> So let's quickly install the block entry at first to ensure uninterrupted >>> memory access of the other vCPUs, and then unmap the page mappings after >>> installation. This will reduce most of the time when the table entry is >>> invalid, and avoid most of the unnecessary translation faults. >> I'm not convinced I've fully understood what is going on yet, but it seems to me >> that the idea is sound. Some questions and comments below. > What I am trying to do in this patch is to adjust the order of rebuilding block > mappings from page mappings. > Take the rebuilding of 1G block mappings as an example. > Before this patch, the order is like: > 1) invalidate the table entry of the 1st level(PUD) > 2) flush TLB by VMID > 3) unmap the old PMD/PTE tables > 4) install the new block entry to the 1st level(PUD) > > So entry in the 1st level can be found invalid by other vcpus in 1), 2), and 3), > and it's a long time in 3) to unmap > the numerous old PMD/PTE tables, which means the total time of the entry being > invalid is long enough to > affect the performance. > > After this patch, the order is like: > 1) invalidate the table ebtry of the 1st level(PUD) > 2) flush TLB by VMID > 3) install the new block entry to the 1st level(PUD) > 4) unmap the old PMD/PTE tables > > The change ensures that period of entry in the 1st level(PUD) being invalid is > only in 1) and 2), > so if other vcpus access memory within 1G, there will be less chance to find the > entry invalid > and as a result trigger an unnecessary translation fault. Thank you for the explanation, that was my understand of it also, and I believe your idea is correct. I was more concerned that I got some of the details wrong, and you have kindly corrected me below. >>> Signed-off-by: Yanan Wang <wangyanan55@xxxxxxxxxx> >>> --- >>> arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++-------------- >>> 1 file changed, 12 insertions(+), 14 deletions(-) >>> >>> diff --git a/arch/arm64/kvm/hyp/pgtable.c b/arch/arm64/kvm/hyp/pgtable.c >>> index 78a560446f80..308c36b9cd21 100644 >>> --- a/arch/arm64/kvm/hyp/pgtable.c >>> +++ b/arch/arm64/kvm/hyp/pgtable.c >>> @@ -434,6 +434,7 @@ struct stage2_map_data { >>> kvm_pte_t attr; >>> kvm_pte_t *anchor; >>> + kvm_pte_t *follow; >>> struct kvm_s2_mmu *mmu; >>> struct kvm_mmu_memory_cache *memcache; >>> @@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64 addr, u64 end, >>> u32 level, >>> if (!kvm_block_mapping_supported(addr, end, data->phys, level)) >>> return 0; >>> - kvm_set_invalid_pte(ptep); >>> - >>> /* >>> - * Invalidate the whole stage-2, as we may have numerous leaf >>> - * entries below us which would otherwise need invalidating >>> - * individually. >>> + * If we need to coalesce existing table entries into a block here, >>> + * then install the block entry first and the sub-level page mappings >>> + * will be unmapped later. >>> */ >>> - kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu); >>> data->anchor = ptep; >>> + data->follow = kvm_pte_follow(*ptep); >>> + stage2_coalesce_tables_into_block(addr, level, ptep, data); >> Here's how stage2_coalesce_tables_into_block() is implemented from the previous >> patch (it might be worth merging it with this patch, I found it impossible to >> judge if the function is correct without seeing how it is used and what is >> replacing): > Ok, will do this if v2 is going to be post. >> static void stage2_coalesce_tables_into_block(u64 addr, u32 level, >> kvm_pte_t *ptep, >> struct stage2_map_data *data) >> { >> u64 granule = kvm_granule_size(level), phys = data->phys; >> kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr, level); >> >> kvm_set_invalid_pte(ptep); >> >> /* >> * Invalidate the whole stage-2, as we may have numerous leaf entries >> * below us which would otherwise need invalidating individually. >> */ >> kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu); >> smp_store_release(ptep, new); >> data->phys += granule; >> } >> >> This works because __kvm_pgtable_visit() saves the *ptep value before calling the >> pre callback, and it visits the next level table based on the initial pte value, >> not the new value written by stage2_coalesce_tables_into_block(). > Right. So before replacing the initial pte value with the new value, we have to use > *data->follow = kvm_pte_follow(*ptep)* in stage2_map_walk_table_pre() to save > the initial pte value in advance. And data->follow will be used when we start to > unmap the old sub-level tables later. Right, stage2_map_walk_table_post() will use data->follow to free the table page which is no longer needed because we've replaced the entire next level table with a block mapping. >> >> Assuming the first patch in the series is merged ("KVM: arm64: Move the clean of >> dcache to the map handler"), this function is missing the CMOs from >> stage2_map_walker_try_leaf(). > Yes, the CMOs are not performed in stage2_coalesce_tables_into_block() currently, > because I thought they were not needed when we rebuild the block mappings from > normal page mappings. This assumes that the *only* situation when we replace a table entry with a block mapping is when the next level table (or tables) is *fully* populated. Is there a way to prove that this is true? I think it's important to prove it unequivocally, because if there's a corner case where this doesn't happen and we remove the dcache maintenance, we can end up with hard to reproduce and hard to diagnose errors in a guest. > > At least, they are not needed if we rebuild the block mappings backed by hugetlbfs > pages, because we must have built the new block mappings for the first time before > and now need to rebuild them after they were split in dirty logging. Can we > agree on this? > Then let's see the following situation. >> I can think of the following situation where they >> are needed: >> >> 1. The 2nd level (PMD) table that will be turned into a block is mapped at stage 2 >> because one of the pages in the 3rd level (PTE) table it points to is accessed by >> the guest. >> >> 2. The kernel decides to turn the userspace mapping into a transparent huge page >> and calls the mmu notifier to remove the mapping from stage 2. The 2nd level table >> is still valid. > I have a question here. Won't the PMD entry been invalidated too in this case? > If remove of the stage2 mapping by mmu notifier is an unmap operation of a range, > then it's correct and reasonable to both invalidate the PMD entry and free the > PTE table. > As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range. > > And if I was right about this, we will not end up in > stage2_coalesce_tables_into_block() > like step 3 describes, but in stage2_map_walker_try_leaf() instead. Because the > PMD entry > is invalid, so KVM will create the new 2M block mapping. Looking at the code for stage2_unmap_walker(), I believe you are correct. After the entire PTE table has been unmapped, the function will mark the PMD entry as invalid. In the situation I described, at step 3 we would end up in the leaf mapper function because the PMD entry is invalid. My example was wrong. > > If I'm wrong about this, then I think this is a valid situation. >> 3. Guest accesses a page which is not the page it accessed at step 1, which causes >> a translation fault. KVM decides we can use a PMD block mapping to map the address >> and we end up in stage2_coalesce_tables_into_block(). We need CMOs in this case >> because the guest accesses memory it didn't access before. >> >> What do you think, is that a valid situation? >>> return 0; >>> } >>> @@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64 addr, u64 >>> end, u32 level, >>> kvm_pte_t *ptep, >>> struct stage2_map_data *data) >>> { >>> - int ret = 0; >>> - >>> if (!data->anchor) >>> return 0; >>> - free_page((unsigned long)kvm_pte_follow(*ptep)); >>> - put_page(virt_to_page(ptep)); >>> - >>> - if (data->anchor == ptep) { >>> + if (data->anchor != ptep) { >>> + free_page((unsigned long)kvm_pte_follow(*ptep)); >>> + put_page(virt_to_page(ptep)); >>> + } else { >>> + free_page((unsigned long)data->follow); >>> data->anchor = NULL; >>> - ret = stage2_map_walk_leaf(addr, end, level, ptep, data); >> stage2_map_walk_leaf() -> stage2_map_walk_table_post calls put_page() and >> get_page() once in our case (valid old mapping). It looks to me like we're missing >> a put_page() call when the function is called for the anchor. Have you found the >> call to be unnecessary? > Before this patch: > When we find data->anchor == ptep, put_page() has been called once in advance > for the anchor > in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() -> > stage2_map_walker_try_leaf() > to install the block entry, and only get_page() will be called once in > stage2_map_walker_try_leaf(). > There is a put_page() followed by a get_page() for the anchor, and there will > not be a problem about > page_counts. This is how I'm reading the code before your patch: - stage2_map_walk_table_post() returns early if there is no anchor. - stage2_map_walk_table_pre() sets the anchor and marks the entry as invalid. The entry was a table so the leaf visitor is not called in __kvm_pgtable_visit(). - __kvm_pgtable_visit() visits the next level table. - stage2_map_walk_table_post() calls put_page(), calls stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The old entry was invalidated by the pre visitor, so it only calls get_page() (and not put_page() + get_page(). I agree with your conclusion, I didn't realize that because the pre visitor marks the entry as invalid, stage2_map_walker_try_leaf() will not call put_page(). > > After this patch: > Before we find data->anchor == ptep and after it, there is not a put_page() call > for the anchor. > This is because that we didn't call get_page() either in > stage2_coalesce_tables_into_block() when > install the block entry. So I think there will not be a problem too. I agree, the refcount will be identical. > > Is above the right answer for your point? Yes, thank you clearing that up for me. Thanks, Alex >>> } >>> - return ret; >>> + return 0; >> I think it's correct for this function to succeed unconditionally. The error was >> coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf(). The function >> can return an error code if block mapping is not supported, which we know is >> supported because we have an anchor, and if only the permissions are different >> between the old and the new entry, but in our case we've changed both the valid >> and type bits. > Agreed. Besides, we will definitely not end up updating an old valid entry for > the anchor > in stage2_map_walker_try_leaf(), because *anchor has already been invalidated in > stage2_map_walk_table_pre() before set the anchor, so it will look like a build > of new mapping. > > Thanks, > > Yanan >> Thanks, >> >> Alex >> >>> } >>> /* >> .