Hi Laurent, Regression test for v12 patch serials have been run on Intel 2s skylake platform, some regressions were found by LKP-tools (linux kernel performance). Only tested the cases that have been run and found regressions on v11 patch serials. Get the patch serials from https://github.com/ldu4/linux/tree/spf-v12. Kernel commit: base: a297558ad4479e0c9c5c14f3f69fe43113f72d1c (v5.1-rc4-mmotm-2019-04-09-17-51) head: 02c5a1f984a8061d075cfd74986ac8aa01d81064 (spf-v12) Benchmark: will-it-scale Download link: https://github.com/antonblanchard/will-it-scale/tree/master Metrics: will-it-scale.per_thread_ops=threads/nr_cpu test box: lkp-skl-2sp8(nr_cpu=72,memory=192G) THP: enable / disable nr_task: 100% The following is benchmark results, tested 4 times for every case. a). Enable THP base %stddev change head %stddev will-it-scale.page_fault3.per_thread_ops 63216 ±3% -16.9% 52537 ±4% will-it-scale.page_fault2.per_thread_ops 36862 -9.8% 33256 b). Disable THP base %stddev change head %stddev will-it-scale.page_fault3.per_thread_ops 65111 -18.6% 53023 ±2% will-it-scale.page_fault2.per_thread_ops 38164 -12.0% 33565 Best regards, Haiyan Song On Tue, Apr 16, 2019 at 03:44:51PM +0200, Laurent Dufour wrote: > This is a port on kernel 5.1 of the work done by Peter Zijlstra to handle > page fault without holding the mm semaphore [1]. > > The idea is to try to handle user space page faults without holding the > mmap_sem. This should allow better concurrency for massively threaded > process since the page fault handler will not wait for other threads memory > layout change to be done, assuming that this change is done in another part > of the process's memory space. This type of page fault is named speculative > page fault. If the speculative page fault fails because a concurrency has > been detected or because underlying PMD or PTE tables are not yet > allocating, it is failing its processing and a regular page fault is then > tried. > > The speculative page fault (SPF) has to look for the VMA matching the fault > address without holding the mmap_sem, this is done by protecting the MM RB > tree with RCU and by using a reference counter on each VMA. When fetching a > VMA under the RCU protection, the VMA's reference counter is incremented to > ensure that the VMA will not freed in our back during the SPF > processing. Once that processing is done the VMA's reference counter is > decremented. To ensure that a VMA is still present when walking the RB tree > locklessly, the VMA's reference counter is incremented when that VMA is > linked in the RB tree. When the VMA is unlinked from the RB tree, its > reference counter will be decremented at the end of the RCU grace period, > ensuring it will be available during this time. This means that the VMA > freeing could be delayed and could delay the file closing for file > mapping. Since the SPF handler is not able to manage file mapping, file is > closed synchronously and not during the RCU cleaning. This is safe since > the page fault handler is aborting if a file pointer is associated to the > VMA. > > Using RCU fixes the overhead seen by Haiyan Song using the will-it-scale > benchmark [2]. > > The VMA's attributes checked during the speculative page fault processing > have to be protected against parallel changes. This is done by using a per > VMA sequence lock. This sequence lock allows the speculative page fault > handler to fast check for parallel changes in progress and to abort the > speculative page fault in that case. > > Once the VMA has been found, the speculative page fault handler would check > for the VMA's attributes to verify that the page fault has to be handled > correctly or not. Thus, the VMA is protected through a sequence lock which > allows fast detection of concurrent VMA changes. If such a change is > detected, the speculative page fault is aborted and a *classic* page fault > is tried. VMA sequence lockings are added when VMA attributes which are > checked during the page fault are modified. > > When the PTE is fetched, the VMA is checked to see if it has been changed, > so once the page table is locked, the VMA is valid, so any other changes > leading to touching this PTE will need to lock the page table, so no > parallel change is possible at this time. > > The locking of the PTE is done with interrupts disabled, this allows > checking for the PMD to ensure that there is not an ongoing collapsing > operation. Since khugepaged is firstly set the PMD to pmd_none and then is > waiting for the other CPU to have caught the IPI interrupt, if the pmd is > valid at the time the PTE is locked, we have the guarantee that the > collapsing operation will have to wait on the PTE lock to move > forward. This allows the SPF handler to map the PTE safely. If the PMD > value is different from the one recorded at the beginning of the SPF > operation, the classic page fault handler will be called to handle the > operation while holding the mmap_sem. As the PTE lock is done with the > interrupts disabled, the lock is done using spin_trylock() to avoid dead > lock when handling a page fault while a TLB invalidate is requested by > another CPU holding the PTE. > > In pseudo code, this could be seen as: > speculative_page_fault() > { > vma = find_vma_rcu() > check vma sequence count > check vma's support > disable interrupt > check pgd,p4d,...,pte > save pmd and pte in vmf > save vma sequence counter in vmf > enable interrupt > check vma sequence count > handle_pte_fault(vma) > .. > page = alloc_page() > pte_map_lock() > disable interrupt > abort if sequence counter has changed > abort if pmd or pte has changed > pte map and lock > enable interrupt > if abort > free page > abort > ... > put_vma(vma) > } > > arch_fault_handler() > { > if (speculative_page_fault(&vma)) > goto done > again: > lock(mmap_sem) > vma = find_vma(); > handle_pte_fault(vma); > if retry > unlock(mmap_sem) > goto again; > done: > handle fault error > } > > Support for THP is not done because when checking for the PMD, we can be > confused by an in progress collapsing operation done by khugepaged. The > issue is that pmd_none() could be true either if the PMD is not already > populated or if the underlying PTE are in the way to be collapsed. So we > cannot safely allocate a PMD if pmd_none() is true. > > This series add a new software performance event named 'speculative-faults' > or 'spf'. It counts the number of successful page fault event handled > speculatively. When recording 'faults,spf' events, the faults one is > counting the total number of page fault events while 'spf' is only counting > the part of the faults processed speculatively. > > There are some trace events introduced by this series. They allow > identifying why the page faults were not processed speculatively. This > doesn't take in account the faults generated by a monothreaded process > which directly processed while holding the mmap_sem. This trace events are > grouped in a system named 'pagefault', they are: > > - pagefault:spf_vma_changed : if the VMA has been changed in our back > - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set. > - pagefault:spf_vma_notsup : the VMA's type is not supported > - pagefault:spf_vma_access : the VMA's access right are not respected > - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our > back. > > To record all the related events, the easier is to run perf with the > following arguments : > $ perf stat -e 'faults,spf,pagefault:*' <command> > > There is also a dedicated vmstat counter showing the number of successful > page fault handled speculatively. I can be seen this way: > $ grep speculative_pgfault /proc/vmstat > > It is possible to deactivate the speculative page fault handler by echoing > 0 in /proc/sys/vm/speculative_page_fault. > > This series builds on top of v5.1-rc4-mmotm-2019-04-09-17-51 and is > functional on x86, PowerPC. I cross built it on arm64 but I was not able to > test it. > > This series is also available on github [4]. > > --------------------- > Real Workload results > > Test using a "popular in memory multithreaded database product" on 128cores > SMT8 Power system are in progress and I will come back with performance > mesurement as soon as possible. With the previous series we seen up to 30% > improvements in the number of transaction processed per second, and we hope > this will be the case with this series too. > > ------------------ > Benchmarks results > > Base kernel is v5.1-rc4-mmotm-2019-04-09-17-51 > SPF is BASE + this series > > Kernbench: > ---------- > Here are the results on a 48 CPUs X86 system using kernbench on a 5.0 > kernel (kernel is build 5 times): > > Average Half load -j 24 > Run (std deviation) > BASE SPF > Elapsed Time 56.52 (1.39185) 56.256 (1.15106) 0.47% > User Time 980.018 (2.94734) 984.958 (1.98518) -0.50% > System Time 130.744 (1.19148) 133.616 (0.873573) -2.20% > Percent CPU 1965.6 (49.682) 1988.4 (40.035) -1.16% > Context Switches 29926.6 (272.789) 30472.4 (109.569) -1.82% > Sleeps 124793 (415.87) 125003 (591.008) -0.17% > > Average Optimal load -j 48 > Run (std deviation) > BASE SPF > Elapsed Time 46.354 (0.917949) 45.968 (1.42786) 0.83% > User Time 1193.42 (224.96) 1196.78 (223.28) -0.28% > System Time 143.306 (13.2726) 146.177 (13.2659) -2.00% > Percent CPU 2668.6 (743.157) 2699.9 (753.767) -1.17% > Context Switches 62268.3 (34097.1) 62721.7 (33999.1) -0.73% > Sleeps 132556 (8222.99) 132607 (8077.6) -0.04% > > During a run on the SPF, perf events were captured: > Performance counter stats for '../kernbench -M': > 525,873,132 faults > 242 spf > 0 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 441 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > Very few speculative page faults were recorded as most of the processes > involved are monothreaded (sounds that on this architecture some threads > were created during the kernel build processing). > > Here are the kerbench results on a 1024 CPUs Power8 VM: > > 5.1.0-rc4-mm1+ 5.1.0-rc4-mm1-spf-rcu+ > Average Half load -j 512 Run (std deviation): > Elapsed Time 52.52 (0.906697) 52.778 (0.510069) -0.49% > User Time 3855.43 (76.378) 3890.44 (73.0466) -0.91% > System Time 1977.24 (182.316) 1974.56 (166.097) 0.14% > Percent CPU 11111.6 (540.461) 11115.2 (458.907) -0.03% > Context Switches 83245.6 (3061.44) 83651.8 (1202.31) -0.49% > Sleeps 613459 (23091.8) 628378 (27485.2) -2.43% > > Average Optimal load -j 1024 Run (std deviation): > Elapsed Time 52.964 (0.572346) 53.132 (0.825694) -0.32% > User Time 4058.22 (222.034) 4070.2 (201.646) -0.30% > System Time 2672.81 (759.207) 2712.13 (797.292) -1.47% > Percent CPU 12756.7 (1786.35) 12806.5 (1858.89) -0.39% > Context Switches 88818.5 (6772) 87890.6 (5567.72) 1.04% > Sleeps 618658 (20842.2) 636297 (25044) -2.85% > > During a run on the SPF, perf events were captured: > Performance counter stats for '../kernbench -M': > 149 375 832 faults > 1 spf > 0 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 561 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > Most of the processes involved are monothreaded so SPF is not activated but > there is no impact on the performance. > > Ebizzy: > ------- > The test is counting the number of records per second it can manage, the > higher is the best. I run it like this 'ebizzy -mTt <nrcpus>'. To get > consistent result I repeated the test 100 times and measure the average > result. The number is the record processes per second, the higher is the best. > > BASE SPF delta > 24 CPUs x86 5492.69 9383.07 70.83% > 1024 CPUS P8 VM 8476.74 17144.38 102% > > Here are the performance counter read during a run on a 48 CPUs x86 node: > Performance counter stats for './ebizzy -mTt 48': > 11,846,569 faults > 10,886,706 spf > 957,702 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 815 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > And the ones captured during a run on a 1024 CPUs Power VM: > Performance counter stats for './ebizzy -mTt 1024': > 1 359 789 faults > 1 284 910 spf > 72 085 pagefault:spf_vma_changed > 0 pagefault:spf_vma_noanon > 2 669 pagefault:spf_vma_notsup > 0 pagefault:spf_vma_access > 0 pagefault:spf_pmd_changed > > In ebizzy's case most of the page fault were handled in a speculative way, > leading the ebizzy performance boost. > > ------------------ > Changes since v11 [3] > - Check vm_ops.fault instead of vm_ops since now all the VMA as a vm_ops. > - Abort speculative page fault when doing swap readhead because VMA's > boundaries are not protected at this time. Doing this the first swap in > is doing a readhead, the next fault should be handled in a speculative > way as the page is present in the swap read page. > - Handle a race between copy_pte_range() and the wp_page_copy called by > the speculative page fault handler. > - Ported to Kernel v5.0 > - Moved VM_FAULT_PTNOTSAME define in mm_types.h > - Use RCU to protect the MM RB tree instead of a rwlock. > - Add a toggle interface: /proc/sys/vm/speculative_page_fault > > [1] https://lore.kernel.org/linux-mm/20141020215633.717315139@xxxxxxxxxxxxx/ > [2] https://lore.kernel.org/linux-mm/9FE19350E8A7EE45B64D8D63D368C8966B847F54@xxxxxxxxxxxxxxxxxxxxxxxxxxxx/ > [3] https://lore.kernel.org/linux-mm/1526555193-7242-1-git-send-email-ldufour@xxxxxxxxxxxxxxxxxx/ > [4] https://github.com/ldu4/linux/tree/spf-v12 > > Laurent Dufour (25): > mm: introduce CONFIG_SPECULATIVE_PAGE_FAULT > x86/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > powerpc/mm: set ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > mm: introduce pte_spinlock for FAULT_FLAG_SPECULATIVE > mm: make pte_unmap_same compatible with SPF > mm: introduce INIT_VMA() > mm: protect VMA modifications using VMA sequence count > mm: protect mremap() against SPF hanlder > mm: protect SPF handler against anon_vma changes > mm: cache some VMA fields in the vm_fault structure > mm/migrate: Pass vm_fault pointer to migrate_misplaced_page() > mm: introduce __lru_cache_add_active_or_unevictable > mm: introduce __vm_normal_page() > mm: introduce __page_add_new_anon_rmap() > mm: protect against PTE changes done by dup_mmap() > mm: protect the RB tree with a sequence lock > mm: introduce vma reference counter > mm: Introduce find_vma_rcu() > mm: don't do swap readahead during speculative page fault > mm: adding speculative page fault failure trace events > perf: add a speculative page fault sw event > perf tools: add support for the SPF perf event > mm: add speculative page fault vmstats > powerpc/mm: add speculative page fault > mm: Add a speculative page fault switch in sysctl > > Mahendran Ganesh (2): > arm64/mm: define ARCH_SUPPORTS_SPECULATIVE_PAGE_FAULT > arm64/mm: add speculative page fault > > Peter Zijlstra (4): > mm: prepare for FAULT_FLAG_SPECULATIVE > mm: VMA sequence count > mm: provide speculative fault infrastructure > x86/mm: add speculative pagefault handling > > arch/arm64/Kconfig | 1 + > arch/arm64/mm/fault.c | 12 + > arch/powerpc/Kconfig | 1 + > arch/powerpc/mm/fault.c | 16 + > arch/x86/Kconfig | 1 + > arch/x86/mm/fault.c | 14 + > fs/exec.c | 1 + > fs/proc/task_mmu.c | 5 +- > fs/userfaultfd.c | 17 +- > include/linux/hugetlb_inline.h | 2 +- > include/linux/migrate.h | 4 +- > include/linux/mm.h | 138 +++++- > include/linux/mm_types.h | 16 +- > include/linux/pagemap.h | 4 +- > include/linux/rmap.h | 12 +- > include/linux/swap.h | 10 +- > include/linux/vm_event_item.h | 3 + > include/trace/events/pagefault.h | 80 ++++ > include/uapi/linux/perf_event.h | 1 + > kernel/fork.c | 35 +- > kernel/sysctl.c | 9 + > mm/Kconfig | 22 + > mm/huge_memory.c | 6 +- > mm/hugetlb.c | 2 + > mm/init-mm.c | 3 + > mm/internal.h | 45 ++ > mm/khugepaged.c | 5 + > mm/madvise.c | 6 +- > mm/memory.c | 631 ++++++++++++++++++++++---- > mm/mempolicy.c | 51 ++- > mm/migrate.c | 6 +- > mm/mlock.c | 13 +- > mm/mmap.c | 249 ++++++++-- > mm/mprotect.c | 4 +- > mm/mremap.c | 13 + > mm/nommu.c | 1 + > mm/rmap.c | 5 +- > mm/swap.c | 6 +- > mm/swap_state.c | 10 +- > mm/vmstat.c | 5 +- > tools/include/uapi/linux/perf_event.h | 1 + > tools/perf/util/evsel.c | 1 + > tools/perf/util/parse-events.c | 4 + > tools/perf/util/parse-events.l | 1 + > tools/perf/util/python.c | 1 + > 45 files changed, 1277 insertions(+), 196 deletions(-) > create mode 100644 include/trace/events/pagefault.h > > -- > 2.21.0 >