On 2024/11/8 15:38, Qi Zheng wrote:
Hi Jann,
On 2024/11/8 06:39, Jann Horn wrote:
+x86 MM maintainers - x86@xxxxxxxxxx was already cc'ed, but I don't
know if that is enough for them to see it, and I haven't seen them
comment on this series yet; I think you need an ack from them for this
change.
Yes, thanks to Jann for cc-ing x86 MM maintainers, and look forward to
their feedback!
On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
wrote:
Now, if CONFIG_MMU_GATHER_RCU_TABLE_FREE is selected, the page table
pages
will be freed by semi RCU, that is:
- batch table freeing: asynchronous free by RCU
- single table freeing: IPI + synchronous free
In this way, the page table can be lockless traversed by disabling
IRQ in
paths such as fast GUP. But this is not enough to free the empty PTE
page
table pages in paths other that munmap and exit_mmap path, because IPI
cannot be synchronized with rcu_read_lock() in pte_offset_map{_lock}().
In preparation for supporting empty PTE page table pages reclaimation,
let single table also be freed by RCU like batch table freeing. Then we
can also use pte_offset_map() etc to prevent PTE page from being freed.
I applied your series locally and followed the page table freeing path
that the reclaim feature would use on x86-64. Looks like it goes like
this with the series applied:
Yes.
free_pte
pte_free_tlb
__pte_free_tlb
___pte_free_tlb
paravirt_tlb_remove_table
tlb_remove_table [!CONFIG_PARAVIRT, Xen PV, Hyper-V, KVM]
[no-free-memory slowpath:]
tlb_table_invalidate
tlb_remove_table_one
tlb_remove_table_sync_one [does IPI for GUP-fast]
^
It seems that this step can be ommitted when
CONFIG_PT_RECLAIM is enabled, because GUP-fast will
disable IRQ, which can also serve as the RCU critical
section.
__tlb_remove_table_one [frees via RCU]
[fastpath:]
tlb_table_flush
tlb_remove_table_free [frees via RCU]
native_tlb_remove_table [CONFIG_PARAVIRT on native]
tlb_remove_table [see above]
Basically, the only remaining case in which
paravirt_tlb_remove_table() does not use tlb_remove_table() with RCU
delay is !CONFIG_PARAVIRT && !CONFIG_PT_RECLAIM. Given that
CONFIG_PT_RECLAIM is defined as "default y" when supported, I guess
that means X86's direct page table freeing path will almost never be
used? If it stays that way and the X86 folks don't see a performance
impact from using RCU to free tables on munmap() / process exit, I
guess we might want to get rid of the direct page table freeing path
on x86 at some point to simplify things...
In theory, using RCU to asynchronously free PTE pages should make
munmap() / process exit path faster. I can try to grab some data.
I ran 'stress-ng --mmap 1 --mmap-bytes 1G', and grabbed the data with
bpftrace like this:
bpftrace -e 'tracepoint:syscalls:sys_enter_munmap /comm ==
"stress-ng"/{@start[tid] = nsecs;} tracepoint:syscalls:sys_exit_munmap
/@start[tid]/ { @ns[comm] = hist(nsecs - @start[tid]);
delete(@start[tid]); } interval:s:1 {exit();}'
The results are as follows:
without patch:
@ns[stress-ng]:
[1K, 2K) 99566
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K) 77756 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[4K, 8K) 32545 |@@@@@@@@@@@@@@@@
|
[8K, 16K) 442 |
|
[16K, 32K) 69 |
|
[32K, 64K) 1 |
|
[64K, 128K) 1 |
|
[128K, 256K) 14 |
|
[256K, 512K) 14 |
|
[512K, 1M) 68 |
|
with patch:
@ns[stress-ng]:
[512, 1K) 69 |
|
[1K, 2K) 53921
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2K, 4K) 47088 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
|
[4K, 8K) 20583 |@@@@@@@@@@@@@@@@@@@
|
[8K, 16K) 659 |
|
[16K, 32K) 93 |
|
[32K, 64K) 24 |
|
[64K, 128K) 14 |
|
[128K, 256K) 6 |
|
[256K, 512K) 10 |
|
[512K, 1M) 29 |
|
It doesn't seem to have much effect on munmap.
Thanks,
Qi