On 08/03/23 12:48, Sidhartha Kumar wrote: > v2 -> v3 > - gather performance data per Mike Kravetz > - remove start variable in remove_inode_hugepages() per Mike Kravetz > - remove hugetlb special case within folio_file_page() > > > ========================= PERFORMANCE ====================================== > > Perf was used to check the performance differences after the patch. Overall > the performance is similar to mainline with a very small larger overhead that > occurs in __filemap_add_folio() and hugetlb_add_to_page_cache(). This is because > of the larger overhead that occurs in xa_load() and xa_store() as the > xarray is now using more entriese to store hugetlb folios in the page cache. > > aarch64: > workload - fallocate a 700GB file backed by huge pages > > 6.5-rc3 + this patch: > 2MB Page Size: > --100.00%--__arm64_sys_fallocate > ksys_fallocate > vfs_fallocate > hugetlbfs_fallocate > | > |--95.04%--__pi_clear_page > | > |--3.57%--clear_huge_page > | | > | |--2.63%--rcu_all_qs > | | > | --0.91%--__cond_resched > | > --0.67%--__cond_resched > 0.17% 0.00% 0 fallocate [kernel.vmlinux] [k] hugetlb_add_to_page_cache > 0.14% 0.10% 11 fallocate [kernel.vmlinux] [k] __filemap_add_folio Thanks for getting the performance data! I think someone may have already mentioned that this should be part of the actual commit message. And, when moved to the actual commit message you might not want to include the data where there were no perf samples in the page cache related code (1GB pages). Any operation where we add a hugetlb page to the cache is going to be immediately preceded by clearing the page (as in fallocate or a fault), or writing to the page (as in userfaultfd). Therefore, the difference in page cache handling is going to be mostly in the noise. This is more so with larger huge page sizes such as 1G. > 6.5-rc3 > 2MB Page Size: > --100.00%--__arm64_sys_fallocate > ksys_fallocate > vfs_fallocate > hugetlbfs_fallocate > | > |--94.91%--__pi_clear_page > | > |--4.11%--clear_huge_page > | | > | |--3.00%--rcu_all_qs > | | > | --1.10%--__cond_resched > | > --0.59%--__cond_resched > 0.08% 0.01% 1 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache > 0.05% 0.03% 3 fallocate [kernel.kallsyms] [k] __filemap_add_folio > > x86 > workload - fallocate a 100GB file backed by huge pages > > 6.5-rc3 + this patch: > 2MB Page Size: > 0.04% 0.04% 1 fallocate [kernel.kallsyms] [k] xa_load > 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] hugetlb_add_to_page_cache > 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] __filemap_add_folio > 0.04% 0.00% 0 fallocate [kernel.kallsyms] [k] xas_store > > 6.5-rc3 > 2MB Page Size: > 0.03% 0.03% 1 fallocate [kernel.kallsyms] [k] __filemap_add_folio What would be helpful is to include the real (user perceived) times with and without your changes. I expect these to be essentially the same. What I want to avoid is introducing any user perceived slowdowns when doing something like adding 1TB of hugetlb pages to the cache. I am aware that this may be done as part of an application (DB) startup. > rebased on mm-unstable 08/02/23 > > > [1]: https://lore.kernel.org/linux-mm/20230519220142.212051-1-sidhartha.kumar@xxxxxxxxxx/T/ > [2]: https://lore.kernel.org/lkml/20230609194947.37196-1-sidhartha.kumar@xxxxxxxxxx/ > [3]: https://lore.kernel.org/lkml/ZLtVlJA+V2+2yjxc@xxxxxxxxxxxxxxxxxxxx/T/ > > fs/hugetlbfs/inode.c | 15 ++++++++------- > include/linux/hugetlb.h | 12 ++++++++++++ > include/linux/pagemap.h | 29 ++--------------------------- > mm/filemap.c | 36 +++++++++++------------------------- > mm/hugetlb.c | 11 ++++++----- > 5 files changed, 39 insertions(+), 64 deletions(-) There has been some code churn since the rebase, so you will need to rebase again. Besides that, the code looks good to me. -- Mike Kravetz