On 2021/3/4 15:07, wangyanan (Y) wrote:
Hi Alex,
On 2021/3/4 1:27, Alexandru Elisei wrote:
Hi Yanan,
On 3/3/21 11:04 AM, wangyanan (Y) wrote:
Hi Alex,
On 2021/3/3 1:13, Alexandru Elisei wrote:
Hello,
On 2/8/21 11:22 AM, Yanan Wang wrote:
When KVM needs to coalesce the normal page mappings into a block
mapping,
we currently invalidate the old table entry first followed by
invalidation
of TLB, then unmap the page mappings, and install the block entry
at last.
It will cost a long time to unmap the numerous page mappings,
which means
there will be a long period when the table entry can be found
invalid.
If other vCPUs access any guest page within the block range and
find the
table entry invalid, they will all exit from guest with a
translation fault
which is not necessary. And KVM will make efforts to handle these
faults,
especially when performing CMOs by block range.
So let's quickly install the block entry at first to ensure
uninterrupted
memory access of the other vCPUs, and then unmap the page mappings
after
installation. This will reduce most of the time when the table
entry is
invalid, and avoid most of the unnecessary translation faults.
I'm not convinced I've fully understood what is going on yet, but
it seems to me
that the idea is sound. Some questions and comments below.
What I am trying to do in this patch is to adjust the order of
rebuilding block
mappings from page mappings.
Take the rebuilding of 1G block mappings as an example.
Before this patch, the order is like:
1) invalidate the table entry of the 1st level(PUD)
2) flush TLB by VMID
3) unmap the old PMD/PTE tables
4) install the new block entry to the 1st level(PUD)
So entry in the 1st level can be found invalid by other vcpus in 1),
2), and 3),
and it's a long time in 3) to unmap
the numerous old PMD/PTE tables, which means the total time of the
entry being
invalid is long enough to
affect the performance.
After this patch, the order is like:
1) invalidate the table ebtry of the 1st level(PUD)
2) flush TLB by VMID
3) install the new block entry to the 1st level(PUD)
4) unmap the old PMD/PTE tables
The change ensures that period of entry in the 1st level(PUD) being
invalid is
only in 1) and 2),
so if other vcpus access memory within 1G, there will be less chance
to find the
entry invalid
and as a result trigger an unnecessary translation fault.
Thank you for the explanation, that was my understand of it also, and
I believe
your idea is correct. I was more concerned that I got some of the
details wrong,
and you have kindly corrected me below.
Signed-off-by: Yanan Wang <wangyanan55@xxxxxxxxxx>
---
arch/arm64/kvm/hyp/pgtable.c | 26 ++++++++++++--------------
1 file changed, 12 insertions(+), 14 deletions(-)
diff --git a/arch/arm64/kvm/hyp/pgtable.c
b/arch/arm64/kvm/hyp/pgtable.c
index 78a560446f80..308c36b9cd21 100644
--- a/arch/arm64/kvm/hyp/pgtable.c
+++ b/arch/arm64/kvm/hyp/pgtable.c
@@ -434,6 +434,7 @@ struct stage2_map_data {
kvm_pte_t attr;
kvm_pte_t *anchor;
+ kvm_pte_t *follow;
struct kvm_s2_mmu *mmu;
struct kvm_mmu_memory_cache *memcache;
@@ -553,15 +554,14 @@ static int stage2_map_walk_table_pre(u64
addr, u64 end,
u32 level,
if (!kvm_block_mapping_supported(addr, end, data->phys,
level))
return 0;
- kvm_set_invalid_pte(ptep);
-
/*
- * Invalidate the whole stage-2, as we may have numerous leaf
- * entries below us which would otherwise need invalidating
- * individually.
+ * If we need to coalesce existing table entries into a block
here,
+ * then install the block entry first and the sub-level page
mappings
+ * will be unmapped later.
*/
- kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
data->anchor = ptep;
+ data->follow = kvm_pte_follow(*ptep);
+ stage2_coalesce_tables_into_block(addr, level, ptep, data);
Here's how stage2_coalesce_tables_into_block() is implemented from
the previous
patch (it might be worth merging it with this patch, I found it
impossible to
judge if the function is correct without seeing how it is used and
what is
replacing):
Ok, will do this if v2 is going to be post.
static void stage2_coalesce_tables_into_block(u64 addr, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
{
u64 granule = kvm_granule_size(level), phys = data->phys;
kvm_pte_t new = kvm_init_valid_leaf_pte(phys, data->attr,
level);
kvm_set_invalid_pte(ptep);
/*
* Invalidate the whole stage-2, as we may have numerous leaf
entries
* below us which would otherwise need invalidating
individually.
*/
kvm_call_hyp(__kvm_tlb_flush_vmid, data->mmu);
smp_store_release(ptep, new);
data->phys += granule;
}
This works because __kvm_pgtable_visit() saves the *ptep value
before calling the
pre callback, and it visits the next level table based on the
initial pte value,
not the new value written by stage2_coalesce_tables_into_block().
Right. So before replacing the initial pte value with the new value,
we have to use
*data->follow = kvm_pte_follow(*ptep)* in
stage2_map_walk_table_pre() to save
the initial pte value in advance. And data->follow will be used
when we start to
unmap the old sub-level tables later.
Right, stage2_map_walk_table_post() will use data->follow to free the
table page
which is no longer needed because we've replaced the entire next
level table with
a block mapping.
Assuming the first patch in the series is merged ("KVM: arm64: Move
the clean of
dcache to the map handler"), this function is missing the CMOs from
stage2_map_walker_try_leaf().
Yes, the CMOs are not performed in
stage2_coalesce_tables_into_block() currently,
because I thought they were not needed when we rebuild the block
mappings from
normal page mappings.
This assumes that the *only* situation when we replace a table entry
with a block
mapping is when the next level table (or tables) is *fully*
populated. Is there a
way to prove that this is true? I think it's important to prove it
unequivocally,
because if there's a corner case where this doesn't happen and we
remove the
dcache maintenance, we can end up with hard to reproduce and hard to
diagnose
errors in a guest.
So there is still one thing left about this patch to determine, and
that is whether we can straightly
discard CMOs in stage2_coalesce_tables_into_block() or we should
distinguish different situations.
Now we know that the situation you have described won't happen, then I
think we will only end up
in stage2_coalesce_tables_into_block() in the following situation:
1) KVM create a new block mapping in stage2_map_walker_try_leaf() for
the first time, if guest accesses
memory backed by a THP/HUGETLB huge page. And CMOs will be
performed here.
2) KVM split this block mapping in dirty logging, and build only one
new page mapping.
3) KVM will build other new page mappings in dirty logging lazily, if
guest access any other pages
within the block. *In this stage, pages in this block may be fully
mapped, or may be not.*
4) After dirty logging is disabled, KVM decides to rebuild the block
mapping.
Do we still have to perform CMOs when rebuilding the block mapping in
step 4, if pages in the block
were not fully mapped in step 3 ? I'm not completely sure about this.
Hi Marc,
Could you please have an answer for above confusion :) ?
Thanks,
Yanan
Thanks,
Yanan
At least, they are not needed if we rebuild the block mappings
backed by hugetlbfs
pages, because we must have built the new block mappings for the
first time before
and now need to rebuild them after they were split in dirty logging.
Can we
agree on this?
Then let's see the following situation.
I can think of the following situation where they
are needed:
1. The 2nd level (PMD) table that will be turned into a block is
mapped at stage 2
because one of the pages in the 3rd level (PTE) table it points to
is accessed by
the guest.
2. The kernel decides to turn the userspace mapping into a
transparent huge page
and calls the mmu notifier to remove the mapping from stage 2. The
2nd level table
is still valid.
I have a question here. Won't the PMD entry been invalidated too in
this case?
If remove of the stage2 mapping by mmu notifier is an unmap
operation of a range,
then it's correct and reasonable to both invalidate the PMD entry
and free the
PTE table.
As I know, kvm_pgtable_stage2_unmap() does so when unmapping a range.
And if I was right about this, we will not end up in
stage2_coalesce_tables_into_block()
like step 3 describes, but in stage2_map_walker_try_leaf() instead.
Because the
PMD entry
is invalid, so KVM will create the new 2M block mapping.
Looking at the code for stage2_unmap_walker(), I believe you are
correct. After
the entire PTE table has been unmapped, the function will mark the
PMD entry as
invalid. In the situation I described, at step 3 we would end up in
the leaf
mapper function because the PMD entry is invalid. My example was wrong.
If I'm wrong about this, then I think this is a valid situation.
3. Guest accesses a page which is not the page it accessed at step
1, which causes
a translation fault. KVM decides we can use a PMD block mapping to
map the address
and we end up in stage2_coalesce_tables_into_block(). We need CMOs
in this case
because the guest accesses memory it didn't access before.
What do you think, is that a valid situation?
return 0;
}
@@ -614,20 +614,18 @@ static int stage2_map_walk_table_post(u64
addr, u64
end, u32 level,
kvm_pte_t *ptep,
struct stage2_map_data *data)
{
- int ret = 0;
-
if (!data->anchor)
return 0;
- free_page((unsigned long)kvm_pte_follow(*ptep));
- put_page(virt_to_page(ptep));
-
- if (data->anchor == ptep) {
+ if (data->anchor != ptep) {
+ free_page((unsigned long)kvm_pte_follow(*ptep));
+ put_page(virt_to_page(ptep));
+ } else {
+ free_page((unsigned long)data->follow);
data->anchor = NULL;
- ret = stage2_map_walk_leaf(addr, end, level, ptep, data);
stage2_map_walk_leaf() -> stage2_map_walk_table_post calls
put_page() and
get_page() once in our case (valid old mapping). It looks to me
like we're missing
a put_page() call when the function is called for the anchor. Have
you found the
call to be unnecessary?
Before this patch:
When we find data->anchor == ptep, put_page() has been called once
in advance
for the anchor
in stage2_map_walk_table_post(). Then we call stage2_map_walk_leaf() ->
stage2_map_walker_try_leaf()
to install the block entry, and only get_page() will be called once in
stage2_map_walker_try_leaf().
There is a put_page() followed by a get_page() for the anchor, and
there will
not be a problem about
page_counts.
This is how I'm reading the code before your patch:
- stage2_map_walk_table_post() returns early if there is no anchor.
- stage2_map_walk_table_pre() sets the anchor and marks the entry as
invalid. The
entry was a table so the leaf visitor is not called in
__kvm_pgtable_visit().
- __kvm_pgtable_visit() visits the next level table.
- stage2_map_walk_table_post() calls put_page(), calls
stage2_map_walk_leaf() ->
stage2_map_walker_try_leaf(). The old entry was invalidated by the
pre visitor, so
it only calls get_page() (and not put_page() + get_page().
I agree with your conclusion, I didn't realize that because the pre
visitor marks
the entry as invalid, stage2_map_walker_try_leaf() will not call
put_page().
After this patch:
Before we find data->anchor == ptep and after it, there is not a
put_page() call
for the anchor.
This is because that we didn't call get_page() either in
stage2_coalesce_tables_into_block() when
install the block entry. So I think there will not be a problem too.
I agree, the refcount will be identical.
Is above the right answer for your point?
Yes, thank you clearing that up for me.
Thanks,
Alex
}
- return ret;
+ return 0;
I think it's correct for this function to succeed unconditionally.
The error was
coming from stage2_map_walk_leaf() -> stage2_map_walker_try_leaf().
The function
can return an error code if block mapping is not supported, which
we know is
supported because we have an anchor, and if only the permissions
are different
between the old and the new entry, but in our case we've changed
both the valid
and type bits.
Agreed. Besides, we will definitely not end up updating an old valid
entry for
the anchor
in stage2_map_walker_try_leaf(), because *anchor has already been
invalidated in
stage2_map_walk_table_pre() before set the anchor, so it will look
like a build
of new mapping.
Thanks,
Yanan
Thanks,
Alex
}
/*
.
.