> On Feb 5, 2024, at 16:26, Gang Li <gang.li@xxxxxxxxx> wrote: > > > > On 2024/2/5 15:28, Muchun Song wrote: >> On 2024/1/26 23:24, Gang Li wrote: >>> @@ -3390,8 +3390,6 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, >>> /* Send list for bulk vmemmap optimization processing */ >>> hugetlb_vmemmap_optimize_folios(h, folio_list); >>> - /* Add all new pool pages to free lists in one lock cycle */ >>> - spin_lock_irqsave(&hugetlb_lock, flags); >>> list_for_each_entry_safe(folio, tmp_f, folio_list, lru) { >>> if (!folio_test_hugetlb_vmemmap_optimized(folio)) { >>> /* >>> @@ -3404,23 +3402,27 @@ static void __init prep_and_add_bootmem_folios(struct hstate *h, >>> HUGETLB_VMEMMAP_RESERVE_PAGES, >>> pages_per_huge_page(h)); >>> } >>> + /* Subdivide locks to achieve better parallel performance */ >>> + spin_lock_irqsave(&hugetlb_lock, flags); >>> __prep_account_new_huge_page(h, folio_nid(folio)); >>> enqueue_hugetlb_folio(h, folio); >>> + spin_unlock_irqrestore(&hugetlb_lock, flags); >>> } >>> - spin_unlock_irqrestore(&hugetlb_lock, flags); >>> } >>> /* >>> * Put bootmem huge pages into the standard lists after mem_map is up. >>> * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages. >>> */ >>> -static void __init gather_bootmem_prealloc(void) >>> +static void __init gather_bootmem_prealloc_node(unsigned long start, unsigned long end, void *arg) >>> + >>> { >>> + int nid = start; >> Sorry for so late to notice an issue here. I have seen a comment from >> PADATA, whcih says: >> @max_threads: Max threads to use for the job, actual number may be less >> depending on task size and minimum chunk size. >> PADATA will not guarantee gather_bootmem_prealloc_node() will be called >> ->max_threads times (You have initialized it to the number of NUMA nodes in >> gather_bootmem_prealloc). Therefore, we should add a loop here to initialize >> multiple nodes, namely (@end - @start) here. Otherwise, we will miss >> initializing some nodes. >> Thanks. >> > In padata_do_multithreaded: > > ``` > /* Ensure at least one thread when size < min_chunk. */ > nworks = max(job->size / max(job->min_chunk, job->align), 1ul); > nworks = min(nworks, job->max_threads); > > ps.nworks = padata_work_alloc_mt(nworks, &ps, &works); > ``` > > So we have works <= max_threads, but >= size/min_chunk. Given a 4-node system, the current implementation will schedule 4 threads to call gather_bootmem_prealloc() respectively, and there is no problems here. But what if PADATA schedules 2 threads and each thread aims to handle 2 nodes? I think it is possible for PADATA in the future, because it does not break any semantics exposed to users. The comment about @min_chunk: The minimum chunk size in job-specific units. This allows the client to communicate the minimum amount of work that's appropriate for one worker thread to do at once. It only defines the minimum chunk size but not maximum size, so it is possible to let each ->thread_fn handle multiple minimum chunk size. Right? Therefore, I am not concerned about the current implementation of PADATA but that of future. Maybe a separate patch is acceptable since it is an improving patch instead of a fix one (at least there is no bug currently). Thanks.