On Sat, Dec 7, 2024 at 1:03 AM Xu Lu <luxu.kernel@xxxxxxxxxxxxx> wrote: > > Hi Pedro, > > On Sat, Dec 7, 2024 at 2:49 AM Pedro Falcato <pedro.falcato@xxxxxxxxx> wrote: > > > > On Fri, Dec 6, 2024 at 1:42 PM Xu Lu <luxu.kernel@xxxxxxxxxxxxx> wrote: > > > > > > Hi David, > > > > > > On Fri, Dec 6, 2024 at 6:13 PM David Hildenbrand <david@xxxxxxxxxx> wrote: > > > > > > > > On 06.12.24 03:00, Zi Yan wrote: > > > > > On 5 Dec 2024, at 5:37, Xu Lu wrote: > > > > > > > > > >> This patch series attempts to break through the limitation of MMU and > > > > >> supports larger base page on RISC-V, which only supports 4K page size > > > > >> now. The key idea is to always manage and allocate memory at a > > > > >> granularity of 64K and use SVNAPOT to accelerate address translation. > > > > >> This is the second version and the detailed introduction can be found > > > > >> in [1]. > > > > >> > > > > >> Changes from v1: > > > > >> - Rebase on v6.12. > > > > >> > > > > >> - Adjust the page table entry shift to reduce page table memory usage. > > > > >> For example, in SV39, the traditional va behaves as: > > > > >> > > > > >> ---------------------------------------------- > > > > >> | pgd index | pmd index | pte index | offset | > > > > >> ---------------------------------------------- > > > > >> | 38 30 | 29 21 | 20 12 | 11 0 | > > > > >> ---------------------------------------------- > > > > >> > > > > >> When we choose 64K as basic software page, va now behaves as: > > > > >> > > > > >> ---------------------------------------------- > > > > >> | pgd index | pmd index | pte index | offset | > > > > >> ---------------------------------------------- > > > > >> | 38 34 | 33 25 | 24 16 | 15 0 | > > > > >> ---------------------------------------------- > > > > >> > > > > >> - Fix some bugs in v1. > > > > >> > > > > >> Thanks in advance for comments. > > > > >> > > > > >> [1] https://lwn.net/Articles/952722/ > > > > > > > > > > This looks very interesting. Can you cc me and linux-mm@xxxxxxxxx > > > > > in the future? Thanks. > > > > > > > > > > Have you thought about doing it for ARM64 4KB as well? ARM64’s contig PTE > > > > > should have similar effect of RISC-V’s SVNAPOT, right? > > > > > > > > What is the real benefit over 4k + large folios/mTHP? > > > > > > > > 64K comes with the problem of internal fragmentation: for example, a > > > > page table that only occupies 4k of memory suddenly consumes 64K; quite > > > > a downside. > > > > > > The original idea comes from the performance benefits we achieved on > > > the ARM 64K kernel. We run several real world applications on the ARM > > > Ampere Altra platform and found these apps' performance based on the > > > 64K page kernel is significantly higher than that on the 4K page > > > kernel: > > > For Redis, the throughput has increased by 250% and latency has > > > decreased by 70%. > > > For Mysql, the throughput has increased by 16.9% and latency has > > > decreased by 14.5%. > > > For our own newsql database, throughput has increased by 16.5% and > > > latency has decreased by 13.8%. > > > > > > Also, we have compared the performance between 64K and 4k + large > > > folios/mTHP on ARM Neoverse-N2. The result shows considerable > > > performance improvement on 64K kernel for both speccpu and lmbench, > > > even when 4K kernel enables THP and ARM64_CONTPTE: > > > For speccpu benchmark, 64K kernel without any huge pages optimization > > > can still achieve 4.17% higher score than 4K kernel with transparent > > > huge pages as well as CONTPTE optimization. > > > For lmbench, 64K kernel achieves 75.98% lower memory mapping > > > latency(16MB) than 4K kernel with transparent huge pages and CONTPTE > > > optimization, 84.34% higher map read open2close bandwidth(16MB), and > > > 10.71% lower random load latency(16MB). > > > Interestingly, sometimes kernel with transparent pages support have > > > poorer performance for both 4K and 64K (for example, mmap read > > > bandwidth bench). We assume this is due to the overhead of huge pages' > > > combination and collapse. > > > Also, if you check the full result, you will find that usually the > > > larger the memory size used for testing is, the better the performance > > > of 64k kernel is (compared to 4K kernel). Unless the memory size lies > > > in a range where 4K kernel can apply 2MB huge pages while 64K kernel > > > can't. > > > In summary, for performance sensitive applications which require > > > higher bandwidth and lower latency, sometimes 4K pages with huge pages > > > may not be the best choice and 64k page can achieve better results. > > > The test environment and result is attached. > > > > > > As RISC-V has no native 64K MMU support, we introduce a software > > > implementation and accelerate it via Svnapot. Of course, there will be > > > some extra overhead compared with native 64K MMU. Thus, we are also > > > trying to persuade the RISC-V community to support the extension of > > > native 64K MMU [1]. Please join us if you are interested. > > > > > > > Ok, so you... didn't test this on riscv? And you're basing this > > patchset off of a native 64KiB page size kernel being faster than 4KiB > > + CONTPTE? I don't see how that makes sense? > > Sorry for the misleading. I didn't intend to use ARM data to support > this patch, just to explain the idea source. We do prefer 64K MMU for > the performance improvement it brought to real applications and > benchmarks. This breaks ABI, doesn't it? Not only userspace needs to be recompiled with 64KB alignment, it also needs not to assume 4KB base page size. > And since RISC-V does not support it yet, we internally > use this patch as a transitional solution for RISC-V. Distros need to support this as well. Otherwise it's a tech island. Also why RV? It can be a generic feature which can apply to other archs like x86, right? See "page clustering" [1][2]. [1] https://lwn.net/Articles/23785/ [2] https://lore.kernel.org/linux-mm/Pine.LNX.4.21.0107051737340.1577-100000@localhost.localdomain/ > And if native > 64k MMU is available, this patch can be canceled. Why 64KB? Why not 32KB or 128KB? In general, the less dependency on h/w, the better. Ideally, *if* we want to consider this, it should be a s/w feature applicable to all (or most of) archs. > The only usage of > this patch I can think of then is to make the kernel support more page > sizes than MMU, as long as Svnapot supports the corresponding size. > > We will try to release the performance data in the next version. There > have been more issues with applications and OS adaptation:) So this > version is still an RFC. > > > > > /me is confused > > > > How many of these PAGE_SIZE wins are related to e.g userspace basing > > its buffer sizes (or whatever) off of the system page size? Where > > exactly are you gaining time versus the CONTPTE stuff? > > I think MM in general would be better off if we were more transparent > > with regard to CONTPTE and page sizes instead of hand waving with > > "hardware page size != software page size", which is such a *checks > > notes* 4.4BSD idea... :) At the very least, this patchset seems to go > > against all the work on better supporting large folios and CONTPTE. > > By the way, the core modification of this patch is turning pte > structure to an array of 16 entries to map a 64K page and accelerating > it via Svnapot. I think it is all about architectural pte and has > little impact on pages or folios. Please remind me if anything is > missed and I will try to fix it. > > > > > -- > > Pedro > > Thanks, > > Xu Lu >