On Mon, Feb 03, 2025 at 02:03:05PM +0100, Alexander Gordeev wrote: > > So the question becomes whether to: > > > > (a) Get rid of the conditional pte_mkdirty() entirely (this trial > > balloon) > > (b) Put it in folio_mk_pte() for everybody, not just s390 > > (c) Put it in set_pte_range() as David suggested. > > > > It's feeling like (c) is the best idea. > > I will check option (c) These call stacks end up in set_pte_range(): mk_pte() set_pte_range() finish_fault() do_read_fault() do_pte_missing() __handle_mm_fault() handle_mm_fault() mk_pte() set_pte_range() finish_fault() do_shared_fault() do_pte_missing() __handle_mm_fault() handle_mm_fault() mk_pte() set_pte_range() filemap_map_pages() do_read_fault() do_pte_missing() __handle_mm_fault() handle_mm_fault() Moving the PageDirty() check to generic code works (in a sense page fault volume does not notably increase): diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h index 3ca5af4cfe43..b5d88f2b5214 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -1454,8 +1454,6 @@ static inline pte_t mk_pte(struct page *page, pgprot_t pgprot) unsigned long physpage = page_to_phys(page); pte_t __pte = mk_pte_phys(physpage, pgprot); - if (pte_write(__pte) && PageDirty(page)) - __pte = pte_mkdirty(__pte); return __pte; } diff --git a/mm/memory.c b/mm/memory.c index 539c0f7c6d54..4b04325db2ee 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5116,6 +5116,8 @@ void set_pte_range(struct vm_fault *vmf, struct folio *folio, flush_icache_pages(vma, page, nr); entry = mk_pte(page, vma->vm_page_prot); + if (pte_write(entry) && PageDirty(page)) + entry = pte_mkdirty(entry); if (prefault && arch_wants_old_prefaulted_pte()) entry = pte_mkold(entry); The above is however not exactly the same, since set_pte_range() -> set_ptes() dirtyfies all PTEs in a folio - unlike the current s390 implementation, which dirtyfies a single PTE based on its struct page flag. remove_migration_ptes() probably needs to be updated as well, unless we are fine with a claim that a PTE is allowed not always be pre-dirtyfied. That could also be true for mk_pte() call paths I did not manage to find or will get added in the future. mk_pte() remove_migration_ptes() migrate_pages_batch() migrate_pages_sync() migrate_pages() compact_zone() compact_node() kcompactd() kthread() __ret_from_fork() ret_from_fork() Also, with the above change to set_pte_range() hugetlb PTEs are affected: mk_pte() mk_huge_pte() make_huge_pte() hugetlb_no_page() hugetlb_fault() handle_mm_fault() Thus, we need to consider pre-dirtying of hugetlb PTEs as well, which I think is: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 65068671e460..1c890b3c9453 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -5168,7 +5168,7 @@ static pte_t make_huge_pte(struct vm_area_struct *vma, struct page *page, pte_t entry; unsigned int shift = huge_page_shift(hstate_vma(vma)); - if (try_mkwrite && (vma->vm_flags & VM_WRITE)) { + if ((vma->vm_flags & VM_WRITE) && (try_mkwrite || PageDirty(page))) { entry = huge_pte_mkwrite(huge_pte_mkdirty(mk_huge_pte(page, vma->vm_page_prot))); } else { Thanks!