The patch titled Subject: huge tmpfs: map shmem by huge page pmd or by page team ptes has been removed from the -mm tree. Its filename was huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes.patch This patch was dropped because an updated version will be merged ------------------------------------------------------ From: Hugh Dickins <hughd@xxxxxxxxxx> Subject: huge tmpfs: map shmem by huge page pmd or by page team ptes This is the commit which at last gets huge mappings of tmpfs working, as can be seen from the ShmemPmdMapped line of /proc/meminfo. The main thing here is the trio of functions map_team_by_pmd(), unmap_team_by_pmd() and remap_team_by_ptes() added to huge_memory.c; and of course the enablement of FAULT_FLAG_MAY_HUGE from memory.c to shmem.c, with VM_FAULT_HUGE back from shmem.c to memory.c. But one-line and few-line changes scattered throughout huge_memory.c. Huge tmpfs is relying on the pmd_trans_huge() page table hooks which the original Anonymous THP project placed throughout mm; but skips almost all of its complications, going to its own simpler handling. Kirill has a much better idea of what copy_huge_pmd() should do for pagecache: nothing, just as we don't copy shared file ptes. I shall adopt his idea in a future version, but for now show how to dup team. Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx> Cc: "Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx> Cc: Andres Lagar-Cavilla <andreslc@xxxxxxxxxx> Cc: Yang Shi <yang.shi@xxxxxxxxxx> Cc: Ning Qu <quning@xxxxxxxxx> Cc: David Rientjes <rientjes@xxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- Documentation/vm/transhuge.txt | 38 ++++- include/linux/pageteam.h | 48 ++++++ mm/huge_memory.c | 229 +++++++++++++++++++++++++++++-- mm/memory.c | 12 + 4 files changed, 307 insertions(+), 20 deletions(-) diff -puN Documentation/vm/transhuge.txt~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes Documentation/vm/transhuge.txt --- a/Documentation/vm/transhuge.txt~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes +++ a/Documentation/vm/transhuge.txt @@ -9,8 +9,8 @@ using huge pages for the backing of virt that supports the automatic promotion and demotion of page sizes and without the shortcomings of hugetlbfs. -Currently it only works for anonymous memory mappings but in the -future it can expand over the pagecache layer starting with tmpfs. +Initially it only worked for anonymous memory mappings, but then was +extended to the pagecache layer, starting with tmpfs. The reason applications are running faster is because of two factors. The first factor is almost completely irrelevant and it's not @@ -57,9 +57,8 @@ miss is going to run faster. feature that applies to all dynamic high order allocations in the kernel) -- this initial support only offers the feature in the anonymous memory - regions but it'd be ideal to move it to tmpfs and the pagecache - later +- initial support only offered the feature in anonymous memory regions, + but then it was extended to huge tmpfs pagecache: see section below. Transparent Hugepage Support maximizes the usefulness of free memory if compared to the reservation approach of hugetlbfs by allowing all @@ -458,3 +457,32 @@ exit(2) if an THP crosses VMA boundary. Function deferred_split_huge_page() is used to queue page for splitting. The splitting itself will happen when we get memory pressure via shrinker interface. + +== Huge tmpfs == + +Transparent hugepages were implemented much later in tmpfs. +That implementation shares much of the "pmd" infrastructure +devised for anonymous hugepages, and their reliance on compaction. + +But unlike hugetlbfs, which has always been free to impose its own +restrictions, a transparent implementation of pagecache in tmpfs must +be able to support files both large and small, with large extents +mapped by hugepage pmds at the same time as small extents (of the +very same pagecache) are mapped by ptes. For this reason, the +compound pages used for hugetlbfs and anonymous hugepages were found +unsuitable, and the opposite approach taken: the high-order backing +page is split from the start, and managed as a team of partially +independent small cache pages. + +Huge tmpfs is enabled simply by a "huge=1" mount option, and does not +attend to the boot options, sysfs settings and madvice controlling +anonymous hugepages. Huge tmpfs recovery (putting a hugepage back +together after it was disbanded for reclaim, or after a period of +fragmentation) is done by a workitem scheduled from fault, without +involving khugepaged at all. + +For more info on huge tmpfs, see Documentation/filesystems/tmpfs.txt. +It is an open question whether that implementation forms the basis for +extending transparent hugepages to other filesystems' pagecache: in its +present form, it makes use of struct page's private field, available on +tmpfs, but already in use on most other filesystems. diff -puN include/linux/pageteam.h~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes include/linux/pageteam.h --- a/include/linux/pageteam.h~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes +++ a/include/linux/pageteam.h @@ -29,10 +29,56 @@ static inline struct page *team_head(str return head; } -/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */ +/* + * Returns true if this team is mapped by pmd somewhere. + */ +static inline bool team_pmd_mapped(struct page *head) +{ + return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR; +} + +/* + * Returns true if this was the first mapping by pmd, whereupon mapped stats + * need to be updated. + */ +static inline bool inc_team_pmd_mapped(struct page *head) +{ + return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1; +} + +/* + * Returns true if this was the last mapping by pmd, whereupon mapped stats + * need to be updated. + */ +static inline bool dec_team_pmd_mapped(struct page *head) +{ + return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR; +} + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE +int map_team_by_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd, struct page *page); +void unmap_team_by_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd, struct page *page); +void remap_team_by_ptes(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd); +#else +static inline int map_team_by_pmd(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd, struct page *page) +{ + VM_BUG_ON_PAGE(1, page); + return 0; +} static inline void unmap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, struct page *page) { + VM_BUG_ON_PAGE(1, page); +} +static inline void remap_team_by_ptes(struct vm_area_struct *vma, + unsigned long addr, pmd_t *pmd) +{ + VM_BUG_ON(1); } +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */ #endif /* _LINUX_PAGETEAM_H */ diff -puN mm/huge_memory.c~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes mm/huge_memory.c --- a/mm/huge_memory.c~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes +++ a/mm/huge_memory.c @@ -25,6 +25,7 @@ #include <linux/mman.h> #include <linux/memremap.h> #include <linux/pagemap.h> +#include <linux/pageteam.h> #include <linux/debugfs.h> #include <linux/migrate.h> #include <linux/hashtable.h> @@ -63,6 +64,8 @@ enum scan_result { #define CREATE_TRACE_POINTS #include <trace/events/huge_memory.h> +static void page_remove_team_rmap(struct page *); + /* * By default transparent hugepage support is disabled in order that avoid * to risk increase the memory footprint of applications without a guaranteed @@ -1120,17 +1123,23 @@ int copy_huge_pmd(struct mm_struct *dst_ if (!vma_is_dax(vma)) { /* thp accounting separate from pmd_devmap accounting */ src_page = pmd_page(pmd); - VM_BUG_ON_PAGE(!PageHead(src_page), src_page); get_page(src_page); - page_dup_rmap(src_page, true); - add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); + if (PageAnon(src_page)) { + VM_BUG_ON_PAGE(!PageHead(src_page), src_page); + page_dup_rmap(src_page, true); + pmdp_set_wrprotect(src_mm, addr, src_pmd); + pmd = pmd_wrprotect(pmd); + } else { + VM_BUG_ON_PAGE(!PageTeam(src_page), src_page); + page_dup_rmap(src_page, false); + inc_team_pmd_mapped(src_page); + } + add_mm_counter(dst_mm, mm_counter(src_page), HPAGE_PMD_NR); atomic_long_inc(&dst_mm->nr_ptes); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); } - pmdp_set_wrprotect(src_mm, addr, src_pmd); - pmd = pmd_mkold(pmd_wrprotect(pmd)); - set_pmd_at(dst_mm, addr, dst_pmd, pmd); + set_pmd_at(dst_mm, addr, dst_pmd, pmd_mkold(pmd)); ret = 0; out_unlock: @@ -1429,7 +1438,7 @@ struct page *follow_trans_huge_pmd(struc goto out; page = pmd_page(*pmd); - VM_BUG_ON_PAGE(!PageHead(page), page); + VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page); if (flags & FOLL_TOUCH) touch_pmd(vma, addr, pmd); if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) { @@ -1454,7 +1463,7 @@ struct page *follow_trans_huge_pmd(struc } } page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT; - VM_BUG_ON_PAGE(!PageCompound(page), page); + VM_BUG_ON_PAGE(!PageCompound(page) && !PageTeam(page), page); if (flags & FOLL_GET) get_page(page); @@ -1692,10 +1701,12 @@ int zap_huge_pmd(struct mmu_gather *tlb, tlb_remove_page(tlb, pmd_page(orig_pmd)); } else { struct page *page = pmd_page(orig_pmd); - page_remove_rmap(page, true); + if (PageTeam(page)) + page_remove_team_rmap(page); + page_remove_rmap(page, PageHead(page)); VM_BUG_ON_PAGE(page_mapcount(page) < 0, page); - add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR); - VM_BUG_ON_PAGE(!PageHead(page), page); + VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page); + add_mm_counter(tlb->mm, mm_counter(page), -HPAGE_PMD_NR); pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd)); atomic_long_dec(&tlb->mm->nr_ptes); spin_unlock(ptl); @@ -1739,7 +1750,7 @@ bool move_huge_pmd(struct vm_area_struct VM_BUG_ON(!pmd_none(*new_pmd)); if (pmd_move_must_withdraw(new_ptl, old_ptl) && - vma_is_anonymous(vma)) { + !vma_is_dax(vma)) { pgtable_t pgtable; pgtable = pgtable_trans_huge_withdraw(mm, old_pmd); pgtable_trans_huge_deposit(mm, new_pmd, pgtable); @@ -1789,7 +1800,6 @@ int change_huge_pmd(struct vm_area_struc entry = pmd_mkwrite(entry); ret = HPAGE_PMD_NR; set_pmd_at(mm, addr, pmd, entry); - BUG_ON(!preserve_write && pmd_write(entry)); } spin_unlock(ptl); } @@ -2989,6 +2999,11 @@ void __split_huge_pmd(struct vm_area_str struct mm_struct *mm = vma->vm_mm; unsigned long haddr = address & HPAGE_PMD_MASK; + if (!vma_is_anonymous(vma) && !vma->vm_ops->pmd_fault) { + remap_team_by_ptes(vma, address, pmd); + return; + } + mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE); ptl = pmd_lock(mm, pmd); if (pmd_trans_huge(*pmd)) { @@ -3467,4 +3482,190 @@ static int __init split_huge_pages_debug return 0; } late_initcall(split_huge_pages_debugfs); -#endif +#endif /* CONFIG_DEBUG_FS */ + +/* + * huge pmd support for huge tmpfs + */ + +static void page_add_team_rmap(struct page *page) +{ + VM_BUG_ON_PAGE(PageAnon(page), page); + VM_BUG_ON_PAGE(!PageTeam(page), page); + if (inc_team_pmd_mapped(page)) + __inc_zone_page_state(page, NR_SHMEM_PMDMAPPED); +} + +static void page_remove_team_rmap(struct page *page) +{ + VM_BUG_ON_PAGE(PageAnon(page), page); + VM_BUG_ON_PAGE(!PageTeam(page), page); + if (dec_team_pmd_mapped(page)) + __dec_zone_page_state(page, NR_SHMEM_PMDMAPPED); +} + +int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, struct page *page) +{ + struct mm_struct *mm = vma->vm_mm; + pgtable_t pgtable; + spinlock_t *pml; + pmd_t pmdval; + int ret = VM_FAULT_NOPAGE; + + /* + * Another task may have mapped it in just ahead of us; but we + * have the huge page locked, so others will wait on us now... or, + * is there perhaps some way another might still map in a single pte? + */ + VM_BUG_ON_PAGE(!PageTeam(page), page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + if (!pmd_none(*pmd)) + goto raced2; + + addr &= HPAGE_PMD_MASK; + pgtable = pte_alloc_one(mm, addr); + if (!pgtable) { + ret = VM_FAULT_OOM; + goto raced2; + } + + pml = pmd_lock(mm, pmd); + if (!pmd_none(*pmd)) + goto raced1; + pmdval = mk_pmd(page, vma->vm_page_prot); + pmdval = pmd_mkhuge(pmd_mkdirty(pmdval)); + pgtable_trans_huge_deposit(mm, pmd, pgtable); + set_pmd_at(mm, addr, pmd, pmdval); + page_add_file_rmap(page); + page_add_team_rmap(page); + update_mmu_cache_pmd(vma, addr, pmd); + atomic_long_inc(&mm->nr_ptes); + spin_unlock(pml); + + unlock_page(page); + add_mm_counter(mm, MM_SHMEMPAGES, HPAGE_PMD_NR); + return ret; +raced1: + spin_unlock(pml); + pte_free(mm, pgtable); +raced2: + unlock_page(page); + put_page(page); + return ret; +} + +void unmap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd, struct page *page) +{ + struct mm_struct *mm = vma->vm_mm; + pgtable_t pgtable = NULL; + unsigned long end; + spinlock_t *pml; + + VM_BUG_ON_PAGE(!PageTeam(page), page); + VM_BUG_ON_PAGE(!PageLocked(page), page); + /* + * But even so there might be a racing zap_huge_pmd() or + * remap_team_by_ptes() while the page_table_lock is dropped. + */ + + addr &= HPAGE_PMD_MASK; + end = addr + HPAGE_PMD_SIZE; + + mmu_notifier_invalidate_range_start(mm, addr, end); + pml = pmd_lock(mm, pmd); + if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) { + pmdp_huge_clear_flush(vma, addr, pmd); + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + page_remove_team_rmap(page); + page_remove_rmap(page, false); + atomic_long_dec(&mm->nr_ptes); + } + spin_unlock(pml); + mmu_notifier_invalidate_range_end(mm, addr, end); + + if (!pgtable) + return; + + pte_free(mm, pgtable); + update_hiwater_rss(mm); + add_mm_counter(mm, MM_SHMEMPAGES, -HPAGE_PMD_NR); + put_page(page); +} + +void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr, + pmd_t *pmd) +{ + struct mm_struct *mm = vma->vm_mm; + struct page *head; + struct page *page; + pgtable_t pgtable; + unsigned long end; + spinlock_t *pml; + spinlock_t *ptl; + pte_t *pte; + pmd_t _pmd; + pmd_t pmdval; + pte_t pteval; + + addr &= HPAGE_PMD_MASK; + end = addr + HPAGE_PMD_SIZE; + + mmu_notifier_invalidate_range_start(mm, addr, end); + pml = pmd_lock(mm, pmd); + if (!pmd_trans_huge(*pmd)) + goto raced; + + page = head = pmd_page(*pmd); + pmdval = pmdp_huge_clear_flush(vma, addr, pmd); + pgtable = pgtable_trans_huge_withdraw(mm, pmd); + pmd_populate(mm, &_pmd, pgtable); + ptl = pte_lockptr(mm, &_pmd); + if (ptl != pml) + spin_lock(ptl); + pmd_populate(mm, pmd, pgtable); + update_mmu_cache_pmd(vma, addr, pmd); + + /* + * It would be nice to have prepared this page table in advance, + * so we could just switch from pmd to ptes under one lock. + * But a comment in zap_huge_pmd() warns that ppc64 needs + * to look at the deposited page table when clearing the pmd. + */ + pte = pte_offset_map(pmd, addr); + do { + pteval = pte_mkdirty(mk_pte(page, vma->vm_page_prot)); + if (!pmd_young(pmdval)) + pteval = pte_mkold(pteval); + set_pte_at(mm, addr, pte, pteval); + VM_BUG_ON_PAGE(!PageTeam(page), page); + if (page != head) { + page_add_file_rmap(page); + get_page(page); + } + /* + * Move page flags from head to page, + * as __split_huge_page_tail() does for anon? + * Start off by assuming not, but reconsider later. + */ + } while (pte++, page++, addr += PAGE_SIZE, addr != end); + + /* + * remap_team_by_ptes() is called from various locking contexts. + * Don't dec_team_pmd_mapped() until after that page table has been + * completed (with atomic_long_sub_return supplying a barrier): + * otherwise shmem_disband_hugeteam() may disband it concurrently, + * and pages be freed while mapped. + */ + page_remove_team_rmap(head); + + pte -= HPAGE_PMD_NR; + addr -= HPAGE_PMD_NR; + if (ptl != pml) + spin_unlock(ptl); + pte_unmap(pte); +raced: + spin_unlock(pml); + mmu_notifier_invalidate_range_end(mm, addr, end); +} diff -puN mm/memory.c~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes mm/memory.c --- a/mm/memory.c~huge-tmpfs-map-shmem-by-huge-page-pmd-or-by-page-team-ptes +++ a/mm/memory.c @@ -45,6 +45,7 @@ #include <linux/swap.h> #include <linux/highmem.h> #include <linux/pagemap.h> +#include <linux/pageteam.h> #include <linux/ksm.h> #include <linux/rmap.h> #include <linux/export.h> @@ -2848,11 +2849,21 @@ static int __do_fault(struct vm_area_str vmf.gfp_mask = __get_fault_gfp_mask(vma); vmf.cow_page = cow_page; + /* + * Give huge pmd a chance before allocating pte or trying fault around. + */ + if (unlikely(pmd_none(*pmd))) + vmf.flags |= FAULT_FLAG_MAY_HUGE; + ret = vma->vm_ops->fault(vma, &vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY))) return ret; if (!vmf.page) goto out; + if (unlikely(ret & VM_FAULT_HUGE)) { + ret |= map_team_by_pmd(vma, address, pmd, vmf.page); + return ret; + } if (unlikely(!(ret & VM_FAULT_LOCKED))) lock_page(vmf.page); @@ -3343,6 +3354,7 @@ static int wp_huge_pmd(struct mm_struct return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd); if (vma->vm_ops->pmd_fault) return vma->vm_ops->pmd_fault(vma, address, pmd, flags); + remap_team_by_ptes(vma, address, pmd); return VM_FAULT_FALLBACK; } _ Patches currently in -mm which might be from hughd@xxxxxxxxxx are huge-pagecache-mmap_sem-is-unlocked-when-truncation-splits-pmd.patch mm-update_lru_size-warn-and-reset-bad-lru_size.patch mm-update_lru_size-do-the-__mod_zone_page_state.patch mm-use-__setpageswapbacked-and-dont-clearpageswapbacked.patch tmpfs-preliminary-minor-tidyups.patch mm-proc-sys-vm-stat_refresh-to-force-vmstat-update.patch huge-mm-move_huge_pmd-does-not-need-new_vma.patch huge-pagecache-extend-mremap-pmd-rmap-lockout-to-files.patch arch-fix-has_transparent_hugepage.patch huge-tmpfs-disband-split-huge-pmds-on-race-or-memory-failure.patch huge-tmpfs-extend-get_user_pages_fast-to-shmem-pmd.patch huge-tmpfs-use-unevictable-lru-with-variable-hpage_nr_pages.patch huge-tmpfs-fix-mlocked-meminfo-track-huge-unhuge-mlocks.patch huge-tmpfs-fix-mapped-meminfo-track-huge-unhuge-mappings.patch huge-tmpfs-mem_cgroup-move-charge-on-shmem-huge-pages.patch huge-tmpfs-proc-pid-smaps-show-shmemhugepages.patch huge-tmpfs-recovery-framework-for-reconstituting-huge-pages.patch huge-tmpfs-recovery-shmem_recovery_populate-to-fill-huge-page.patch huge-tmpfs-recovery-shmem_recovery_remap-remap_team_by_pmd.patch huge-tmpfs-recovery-shmem_recovery_swapin-to-read-from-swap.patch huge-tmpfs-recovery-tweak-shmem_getpage_gfp-to-fill-team.patch huge-tmpfs-recovery-debugfs-stats-to-complete-this-phase.patch huge-tmpfs-recovery-page-migration-call-back-into-shmem.patch -- To unsubscribe from this list: send the line "unsubscribe mm-commits" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html