> On Jan 22, 2024, at 18:12, Gang Li <gang.li@xxxxxxxxx> wrote: > > On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote: >>> +static void __init hugetlb_alloc_node(unsigned long start, unsigned long end, void *arg) >>> { >>> - unsigned long i; >>> + struct hstate *h = (struct hstate *)arg; >>> + int i, num = end - start; >>> + nodemask_t node_alloc_noretry; >>> + unsigned long flags; >>> + int next_node = 0; >> This should be first_online_node which may be not zero. > > That's right. Thanks! > >>> - for (i = 0; i < h->max_huge_pages; ++i) { >>> - if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE)) >>> + /* Bit mask controlling how hard we retry per-node allocations.*/ >>> + nodes_clear(node_alloc_noretry); >>> + >>> + for (i = 0; i < num; ++i) { >>> + struct folio *folio = alloc_pool_huge_folio(h, &node_states[N_MEMORY], >>> + &node_alloc_noretry, &next_node); >>> + if (!folio) >>> break; >>> + spin_lock_irqsave(&hugetlb_lock, flags); >> > I suspect there will more contention on this lock when parallelizing. > > In the worst case, there are only 'numa node number' of threads in > contention. And in my testing, it doesn't degrade performance, but > rather improves performance due to the reduced granularity. So, the performance does not change if you move the lock out of loop? > >> I want to know why you chose to drop prep_and_add_allocated_folios() >> call in the original hugetlb_pages_alloc_boot()? > > Splitting him to parallelize hugetlb_vmemmap_optimize_folios. Unfortunately, HVO should be enabled before pages go to the pool list. > >>> +static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h) >>> +{ >>> + struct padata_mt_job job = { >>> + .fn_arg = h, >>> + .align = 1, >>> + .numa_aware = true >>> + }; >>> + >>> + job.thread_fn = hugetlb_alloc_node; >>> + job.start = 0; >>> + job.size = h->max_huge_pages; >>> + job.min_chunk = h->max_huge_pages / num_node_state(N_MEMORY) / 2; >>> + job.max_threads = num_node_state(N_MEMORY) * 2; >> I am curious the magic number of 2 used in assignments of ->min_chunk >> and ->max_threads, does it from your experiment? I thinke it should >> be a comment here. > > This is tested and I can perform more detailed tests and provide data. > >> And I am also sceptical about the optimization for a small amount of >> allocation of hugepages. Given 4 hugepags needed to be allocated on UMA >> system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2 >> workers will be scheduled, however each worker will just allocate 2 pages, >> how much the cost of scheduling? What if allocate 4 pages in single >> worker? Do you have any numbers on parallelism vs non-parallelism in >> a small allocation case? If we cannot gain from this case, I think we shold >> assign a reasonable value to ->min_chunk based on experiment. >> Thanks. >> > > That's a good suggestion, I'll run some tests and choose the best > values. > >