Re: [PATCH v4 6/7] hugetlb: parallelize 2M hugetlb allocation and initialization

Gang Li <gang.li@xxxxxxxxx> · Mon, 22 Jan 2024 18:12:53 +0800

On 2024/1/22 15:10, Muchun Song wrote:> On 2024/1/18 20:39, Gang Li wrote:
+static void __init hugetlb_alloc_node(unsigned long start, unsigned 
long end, void *arg)
  {
-    unsigned long i;
+    struct hstate *h = (struct hstate *)arg;
+    int i, num = end - start;
+    nodemask_t node_alloc_noretry;
+    unsigned long flags;
+    int next_node = 0;

This should be first_online_node which may be not zero.

That's right. Thanks!

-    for (i = 0; i < h->max_huge_pages; ++i) {
-        if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
+    /* Bit mask controlling how hard we retry per-node allocations.*/
+    nodes_clear(node_alloc_noretry);
+
+    for (i = 0; i < num; ++i) {
+        struct folio *folio = alloc_pool_huge_folio(h, 
&node_states[N_MEMORY],
+                        &node_alloc_noretry, &next_node);
+        if (!folio)
              break;
+        spin_lock_irqsave(&hugetlb_lock, flags);
 > I suspect there will more contention on this lock when parallelizing.

In the worst case, there are only 'numa node number' of threads in
contention. And in my testing, it doesn't degrade performance, but
rather improves performance due to the reduced granularity.

I want to know why you chose to drop prep_and_add_allocated_folios()
call in the original hugetlb_pages_alloc_boot()?

Splitting him to parallelize hugetlb_vmemmap_optimize_folios.

+static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
+{
+    struct padata_mt_job job = {
+        .fn_arg        = h,
+        .align        = 1,
+        .numa_aware    = true
+    };
+
+    job.thread_fn    = hugetlb_alloc_node;
+    job.start    = 0;
+    job.size    = h->max_huge_pages;
+    job.min_chunk    = h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+    job.max_threads    = num_node_state(N_MEMORY) * 2;

I am curious the magic number of 2 used in assignments of ->min_chunk
and ->max_threads, does it from your experiment? I thinke it should
be a comment here.

This is tested and I can perform more detailed tests and provide data.

And I am also sceptical about the optimization for a small amount of
allocation of hugepages. Given 4 hugepags needed to be allocated on UMA
system, job.min_chunk will be 2, job.max_threads will be 2. Then, 2
workers will be scheduled, however each worker will just allocate 2 pages,
how much the cost of scheduling? What if allocate 4 pages in single
worker? Do you have any numbers on parallelism vs non-parallelism in
a small allocation case? If we cannot gain from this case, I think we shold
assign a reasonable value to ->min_chunk based on experiment.

Thanks.

That's a good suggestion, I'll run some tests and choose the best
values.