Re: [PATCH v3 04/11] mm: vmalloc: Remove global vmap_area_root rb-tree

Wen Gu <guwen@xxxxxxxxxxxxxxxxx> · Fri, 5 Jan 2024 16:10:32 +0800

On 2024/01/03 02:46, Uladzislau Rezki wrote:

Store allocated objects in a separate nodes. A va->va_start
address is converted into a correct node where it should
be placed and resided. An addr_to_node() function is used
to do a proper address conversion to determine a node that
contains a VA.

Such approach balances VAs across nodes as a result an access
becomes scalable. Number of nodes in a system depends on number
of CPUs.

Please note:

1. As of now allocated VAs are bound to a node-0. It means the
   patch does not give any difference comparing with a current
   behavior;

2. The global vmap_area_lock, vmap_area_root are removed as there
   is no need in it anymore. The vmap_area_list is still kept and
   is _empty_. It is exported for a kexec only;

3. The vmallocinfo and vread() have to be reworked to be able to
   handle multiple nodes.

Reviewed-by: Baoquan He <bhe@xxxxxxxxxx>
Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx>
---

<...>

 struct vmap_area *find_vmap_area(unsigned long addr)
 {
+	struct vmap_node *vn;
 	struct vmap_area *va;
+	int i, j;

-	spin_lock(&vmap_area_lock);
-	va = __find_vmap_area(addr, &vmap_area_root);
-	spin_unlock(&vmap_area_lock);
+	/*
+	 * An addr_to_node_id(addr) converts an address to a node index
+	 * where a VA is located. If VA spans several zones and passed
+	 * addr is not the same as va->va_start, what is not common, we
+	 * may need to scan an extra nodes. See an example:
+	 *
+	 *      <--va-->
+	 * -|-----|-----|-----|-----|-
+	 *     1     2     0     1
+	 *
+	 * VA resides in node 1 whereas it spans 1 and 2. If passed
+	 * addr is within a second node we should do extra work. We
+	 * should mention that it is rare and is a corner case from
+	 * the other hand it has to be covered.
+	 */
+	i = j = addr_to_node_id(addr);
+	do {
+		vn = &vmap_nodes[i];

-	return va;
+		spin_lock(&vn->busy.lock);
+		va = __find_vmap_area(addr, &vn->busy.root);
+		spin_unlock(&vn->busy.lock);
+
+		if (va)
+			return va;
+	} while ((i = (i + 1) % nr_vmap_nodes) != j);
+
+	return NULL;
 }

Hi Uladzislau Rezki,

I really like your work, it is great and helpful!

Currently, I am working on using shared memory communication (SMC [1])
to transparently accelerate TCP communication between two peers within
the same OS instance[2].

In this scenario, a vzalloced kernel buffer acts as a shared memory and
will be simultaneous read or written by two SMC sockets, thus forming an
SMC connection.

          socket1                    socket2
             |                          ^
             |                          |    userspace
      ---- write -------------------- read ------
             |    +-----------------+   |    kernel
             +--->|  shared memory  |---+
                  | (vzalloced now) |
                  +-----------------+

Then I encountered the performance regression caused by lock contention
in find_vmap_area() when multiple threads transfer data through multiple
SMC connections on machines with many CPUs[3].

According to perf, the performance bottleneck is caused by the global
vmap_area_lock contention[4]:

- writer:

  smc_tx_sendmsg
  -> memcpy_from_msg
     -> copy_from_iter
        -> check_copy_size
           -> check_object_size
              -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                 -> if(vm) find_vmap_area
                    -> try to hold vmap_area_lock lock
- reader:

  smc_rx_recvmsg
   -> memcpy_to_msg
      -> copy_to_iter
         -> check_copy_size
            -> check_object_size
               -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object
                  -> if(vm) find_vmap_area
                     -> try to hold vmap_area_lock lock

Fortunately, thank you for this patch set, the global vmap_area_lock was
removed and per node lock vn->busy.lock is introduced. it is really helpful:

In 48 CPUs qemu environment, the Requests/s increased by 5 times:
- nginx
- wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80

                vzalloced shmem      vzalloced shmem(with this patch set)
Requests/sec          113536.56            583729.93

But it also has some overhead, compared to using kzalloced shared memory
or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area:

                kzalloced shmem      vzalloced shmem(unset CONFIG_HARDENED_USERCOPY)
Requests/sec          831950.39            805164.78

So, as a newbie in Linux-mm, I would like to ask for some suggestions:

Is it possible to further eliminate the overhead caused by lock contention
in find_vmap_area() in this scenario (maybe this is asking too much), or the
only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced
buffer in the situation where cocurrent kernel-userspace-copy happens?

Any feedback will be appreciated. Thanks again for your time.

[1] Shared Memory Communications (SMC) enables two SMC capable peers to
    communicate by using memory buffers that each peer allocates for the
    partner's use. It improves throughput, lowers latency and cost, and
    maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315

[2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@xxxxxxxxxxxxxxxxx/

[3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@xxxxxxxxxx/
    analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@xxxxxxxxxxxxxxxxx/

[4] Some flamegraphs are attached,
    - SMC using vzalloced buffer: vzalloc_t96.svg
    - SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg
    - SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg
    - SMC using kzalloced buffer: kzalloc_t96.svg

Best regards,
Wen Gu
Attachment:
vzalloc_t96.svg

Description: image/svg
Attachment:
vzalloc_t96_improve.svg

Description: image/svg
Attachment:
vzalloc_t96_nocheck.svg

Description: image/svg
Attachment:
kzalloc_t96.svg

Description: image/svg