Hi all, hugetlb init parallelization has now been updated to v6. This version is tested on mm/mm-stable. Since the release of v5, there have been some scattered discussions, they have primarily centered around two issues. Both of these issues have now been resolved, leading to the release of v6. Two updates in v6 ----------------- - Fix a Kconfig warning hugetlb parallelization depends on PADATA, and PADATA depends on SMP. When SMP is not selected, selecting PADATA will cause a warning: "WARNING: unmet direct dependencies detected for PADATA". So HUGETLBFS can only select PADATA when SMP is set. padata.c will not be compiled if !SMP, but padata_do_multithreaded is still used in this series for hugetlb parallel init. So it is necessary to implement a serial version in padata.h. - Fix a potential bug in gather_bootmem_prealloc_node padata_do_multithreaded implementation guarantees that each gather_bootmem_prealloc_node task handles one node. However, the API described in padata_do_multithreaded comment indicates that padata_do_multithreaded also can assign multiple nodes to a gather_bootmem_prealloc_node task. To avoid potential bug from future changes in padata_do_multithreaded, gather_bootmem_prealloc_parallel is introduced to wrap the gather_bootmem_prealloc_node. More details in: https://lore.kernel.org/r/20240213111347.3189206-3-gang.li@xxxxxxxxx Introduction ------------ Hugetlb initialization during boot takes up a considerable amount of time. For instance, on a 2TB system, initializing 1,800 1GB huge pages takes 1-2 seconds out of 10 seconds. Initializing 11,776 1GB pages on a 12TB Intel host takes more than 1 minute[1]. This is a noteworthy figure. Inspired by [2] and [3], hugetlb initialization can also be accelerated through parallelization. Kernel already has infrastructure like padata_do_multithreaded, this patch uses it to achieve effective results by minimal modifications. [1] https://lore.kernel.org/all/783f8bac-55b8-5b95-eb6a-11a583675000@xxxxxxxxxx/ [2] https://lore.kernel.org/all/20200527173608.2885243-1-daniel.m.jordan@xxxxxxxxxx/ [3] https://lore.kernel.org/all/20230906112605.2286994-1-usama.arif@xxxxxxxxxxxxx/ [4] https://lore.kernel.org/all/76becfc1-e609-e3e8-2966-4053143170b6@xxxxxxxxxx/ max_threads ----------- This patch use `padata_do_multithreaded` like this: ``` job.max_threads = num_node_state(N_MEMORY) * multiplier; padata_do_multithreaded(&job); ``` To fully utilize the CPU, the number of parallel threads needs to be carefully considered. `max_threads = num_node_state(N_MEMORY)` does not fully utilize the CPU, so we need to multiply it by a multiplier. Tests below indicate that a multiplier of 2 significantly improves performance, and although larger values also provide improvements, the gains are marginal. multiplier 1 2 3 4 5 ------------ ------- ------- ------- ------- ------- 256G 2node 358ms 215ms 157ms 134ms 126ms 2T 4node 979ms 679ms 543ms 489ms 481ms 50G 2node 71ms 44ms 37ms 30ms 31ms Therefore, choosing 2 as the multiplier strikes a good balance between enhancing parallel processing capabilities and maintaining efficient resource management. Test result ----------- test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% 256c2T(4 node) 2M 3336 1051 68.52% 128c1T(2 node) 2M 1943 716 63.15% Change log ---------- Changes in v6: - Fix a Kconfig warning - Fix a potential bug in gather_bootmem_prealloc_node Changes in v5: - https://lore.kernel.org/lkml/20240126152411.1238072-1-gang.li@xxxxxxxxx/ - Use prep_and_add_allocated_folios in 2M hugetlb parallelization - Update huge_boot_pages in arch/powerpc/mm/hugetlbpage.c - Revise struct padata_mt_job comment - Add 'max_threads' section in cover letter - Collect more Reviewed-by Changes in v4: - https://lore.kernel.org/r/20240118123911.88833-1-gang.li@xxxxxxxxx - Make padata_do_multithreaded dispatch all jobs with a global iterator - Revise commit message - Rename some functions - Collect Tested-by and Reviewed-by Changes in v3: - https://lore.kernel.org/all/20240102131249.76622-1-gang.li@xxxxxxxxx/ - Select CONFIG_PADATA as we use padata_do_multithreaded - Fix a race condition in h->next_nid_to_alloc - Fix local variable initialization issues - Remove RFC tag Changes in v2: - https://lore.kernel.org/all/20231208025240.4744-1-gang.li@xxxxxxxxx/ - Reduce complexity with `padata_do_multithreaded` - Support 1G hugetlb v1: - https://lore.kernel.org/all/20231123133036.68540-1-gang.li@xxxxxxxxx/ - parallelize 2M hugetlb initialization with workqueue Gang Li (8): hugetlb: code clean for hugetlb_hstate_alloc_pages hugetlb: split hugetlb_hstate_alloc_pages hugetlb: pass *next_nid_to_alloc directly to for_each_node_mask_to_alloc padata: dispatch works on different nodes padata: downgrade padata_do_multithreaded to serial execution for non-SMP hugetlb: have CONFIG_HUGETLBFS select CONFIG_PADATA hugetlb: parallelize 2M hugetlb allocation and initialization hugetlb: parallelize 1G hugetlb initialization arch/powerpc/mm/hugetlbpage.c | 2 +- fs/Kconfig | 1 + include/linux/hugetlb.h | 2 +- include/linux/padata.h | 14 +- kernel/padata.c | 14 +- mm/hugetlb.c | 241 +++++++++++++++++++++++----------- mm/mm_init.c | 1 + 7 files changed, 190 insertions(+), 85 deletions(-) -- 2.20.1