On 1/17/2023 10:46 PM, Matthew Wilcox wrote: > On Tue, Jan 17, 2023 at 11:37:05AM +0100, David Hildenbrand wrote: >> On 17.01.23 10:19, Yin, Fengwei wrote: >>> >>> >>> On 1/14/2023 2:13 AM, Matthew Wilcox wrote: >>>> On Sat, Jan 14, 2023 at 12:35:38AM +0800, Yin Fengwei wrote: >>>>> The page fault number can be reduced by batched PTEs population. >>>>> The batch size of PTEs population is not allowed to cross: >>>>> - page table boundaries >>>>> - vma range >>>>> - large folio size >>>>> - fault_around_bytes >>>> >>>> I find this patch very interesting. But is it really worth it? Most >>>> file-backed page faults are resolved through the ->map_pages() path >>>> which is almost always filemap_map_pages(), which does something >>>> fairly similar to this already. Do you have any performance numbers? >>>> >>> I tried the will-it-scale page_fault3: >>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault3.c >>> with 96 processes on a test box with 48C/86T. >>> >>> The test result got about 3.75X better with 4.1X less page fault number >>> with this patch. >>> >>> But It's a micro benchmark which shows extreme friendly case to this patch. >>> >>> I didn't see observed performance gain with other workloads. I suppose >>> shared file write operations may not be common operations? Thanks. >> >> One question I have after reading "which does something fairly similar to >> this already", if both paths could be unified. > > I've been thinking about this already; not so much in terms of "unifying > these two implementations" but rather "What's the right API for mapping > the parts of a folio that fit into userspace in response to a fault". > I haven't got quite as far as drafting code, but I'm thinking there should > be an API where we pass in the vmf, a folio and some other information, > and that function takes care of mapping however many pages from this > folio that it can. > > And the reason to split it up like that is to batch as many page > operations into this folio as possible. eg filemap_map_pages() was > written "to get it working", not "to be efficient", so it does stupid > things like call folio_ref_inc() every time it maps a page instead of > counting how many pages it maps and calling folio_ref_add() at the end. > Similar optimisations should be done for mapcount, which implies some > kind of batched equivalent of page_add_*_rmap(). I did following experience (Just did boot testing in Qemu). I think another gain is no folio_trylock/folio_unlock to each sub-pages of large folio. Just need trylock/unlock entire folio once. Batching mapcount may need more changes. Like check no !pte_none case in advance and it's to map entire large folios. diff --git a/mm/filemap.c b/mm/filemap.c index c4d4ace9cc70..a79fd4766d2f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3353,6 +3353,41 @@ static inline struct folio *next_map_page(struct address_space *mapping, mapping, xas, end_pgoff); } + +static vm_fault_t __filemap_map_folio(struct vm_fault *vmf, + struct folio *folio, unsigned long addr, pgoff_t idx, int count) +{ + struct vm_area_struct *vma = vmf->vma; + int i = 0, nr_pages = folio_nr_pages(folio); + int len = ((count + idx) > nr_pages) ? (nr_pages - idx) : count; + int ref_count = 0; + vm_fault_t ret = 0; + + do { + struct page *page; + + page = folio_page(folio, i + idx); + if (PageHWPoison(page)) + continue; + + if (!pte_none(*vmf->pte)) + continue; + + if (vmf->address == addr) + ret = VM_FAULT_NOPAGE; + + ref_count++; + + do_set_pte(vmf, page, addr); + update_mmu_cache(vma, addr, vmf->pte); + } while (vmf->pte++, addr += PAGE_SIZE, ++i < len); + + vmf->pte -= len; + folio_ref_add(folio, ref_count); + + return ret; +} + vm_fault_t filemap_map_pages(struct vm_fault *vmf, pgoff_t start_pgoff, pgoff_t end_pgoff) { @@ -3363,9 +3398,9 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, unsigned long addr; XA_STATE(xas, &mapping->i_pages, start_pgoff); struct folio *folio; - struct page *page; unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss); vm_fault_t ret = 0; + int idx; rcu_read_lock(); folio = first_map_page(mapping, &xas, end_pgoff); @@ -3380,45 +3415,21 @@ vm_fault_t filemap_map_pages(struct vm_fault *vmf, addr = vma->vm_start + ((start_pgoff - vma->vm_pgoff) << PAGE_SHIFT); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf->ptl); do { -again: - page = folio_file_page(folio, xas.xa_index); - if (PageHWPoison(page)) - goto unlock; - - if (mmap_miss > 0) - mmap_miss--; + int count; + idx = xas.xa_index & (folio_nr_pages(folio) - 1); addr += (xas.xa_index - last_pgoff) << PAGE_SHIFT; vmf->pte += xas.xa_index - last_pgoff; last_pgoff = xas.xa_index; - /* - * NOTE: If there're PTE markers, we'll leave them to be - * handled in the specific fault path, and it'll prohibit the - * fault-around logic. - */ - if (!pte_none(*vmf->pte)) - goto unlock; + count = end_pgoff - last_pgoff + 1; - /* We're about to handle the fault */ - if (vmf->address == addr) + if (VM_FAULT_NOPAGE == + __filemap_map_folio(vmf, folio, addr, idx, count)) ret = VM_FAULT_NOPAGE; - do_set_pte(vmf, page, addr); - /* no need to invalidate: a not-present page won't be cached */ - update_mmu_cache(vma, addr, vmf->pte); - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { - xas.xa_index++; - folio_ref_inc(folio); - goto again; - } - folio_unlock(folio); - continue; -unlock: - if (folio_more_pages(folio, xas.xa_index, end_pgoff)) { - xas.xa_index++; - goto again; - } + xas.xa_index += folio_nr_pages(folio) - idx; + folio_unlock(folio); folio_put(folio); } while ((folio = next_map_page(mapping, &xas, end_pgoff)) != NULL); NOTE: PoC code doesn't cover file->f_ra.mmap_miss. Thanks. Regards Yin, Fengwei