On 2024/01/03 02:46, Uladzislau Rezki wrote:
Store allocated objects in a separate nodes. A va->va_start address is converted into a correct node where it should be placed and resided. An addr_to_node() function is used to do a proper address conversion to determine a node that contains a VA. Such approach balances VAs across nodes as a result an access becomes scalable. Number of nodes in a system depends on number of CPUs. Please note: 1. As of now allocated VAs are bound to a node-0. It means the patch does not give any difference comparing with a current behavior; 2. The global vmap_area_lock, vmap_area_root are removed as there is no need in it anymore. The vmap_area_list is still kept and is _empty_. It is exported for a kexec only; 3. The vmallocinfo and vread() have to be reworked to be able to handle multiple nodes. Reviewed-by: Baoquan He <bhe@xxxxxxxxxx> Signed-off-by: Uladzislau Rezki (Sony) <urezki@xxxxxxxxx> ---
<...>
struct vmap_area *find_vmap_area(unsigned long addr) { + struct vmap_node *vn; struct vmap_area *va; + int i, j;- spin_lock(&vmap_area_lock);- va = __find_vmap_area(addr, &vmap_area_root); - spin_unlock(&vmap_area_lock); + /* + * An addr_to_node_id(addr) converts an address to a node index + * where a VA is located. If VA spans several zones and passed + * addr is not the same as va->va_start, what is not common, we + * may need to scan an extra nodes. See an example: + * + * <--va--> + * -|-----|-----|-----|-----|- + * 1 2 0 1 + * + * VA resides in node 1 whereas it spans 1 and 2. If passed + * addr is within a second node we should do extra work. We + * should mention that it is rare and is a corner case from + * the other hand it has to be covered. + */ + i = j = addr_to_node_id(addr); + do { + vn = &vmap_nodes[i];- return va;+ spin_lock(&vn->busy.lock); + va = __find_vmap_area(addr, &vn->busy.root); + spin_unlock(&vn->busy.lock); + + if (va) + return va; + } while ((i = (i + 1) % nr_vmap_nodes) != j); + + return NULL; }
Hi Uladzislau Rezki, I really like your work, it is great and helpful! Currently, I am working on using shared memory communication (SMC [1]) to transparently accelerate TCP communication between two peers within the same OS instance[2]. In this scenario, a vzalloced kernel buffer acts as a shared memory and will be simultaneous read or written by two SMC sockets, thus forming an SMC connection. socket1 socket2 | ^ | | userspace ---- write -------------------- read ------ | +-----------------+ | kernel +--->| shared memory |---+ | (vzalloced now) | +-----------------+ Then I encountered the performance regression caused by lock contention in find_vmap_area() when multiple threads transfer data through multiple SMC connections on machines with many CPUs[3]. According to perf, the performance bottleneck is caused by the global vmap_area_lock contention[4]: - writer: smc_tx_sendmsg -> memcpy_from_msg -> copy_from_iter -> check_copy_size -> check_object_size -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object -> if(vm) find_vmap_area -> try to hold vmap_area_lock lock - reader: smc_rx_recvmsg -> memcpy_to_msg -> copy_to_iter -> check_copy_size -> check_object_size -> if (CONFIG_HARDENED_USERCOPY is set) check_heap_object -> if(vm) find_vmap_area -> try to hold vmap_area_lock lock Fortunately, thank you for this patch set, the global vmap_area_lock was removed and per node lock vn->busy.lock is introduced. it is really helpful: In 48 CPUs qemu environment, the Requests/s increased by 5 times: - nginx - wrk -c 1000 -t 96 -d 30 http://127.0.0.1:80 vzalloced shmem vzalloced shmem(with this patch set) Requests/sec 113536.56 583729.93 But it also has some overhead, compared to using kzalloced shared memory or unsetting CONFIG_HARDENED_USERCOPY, which won't involve finding vmap area: kzalloced shmem vzalloced shmem(unset CONFIG_HARDENED_USERCOPY) Requests/sec 831950.39 805164.78 So, as a newbie in Linux-mm, I would like to ask for some suggestions: Is it possible to further eliminate the overhead caused by lock contention in find_vmap_area() in this scenario (maybe this is asking too much), or the only way out is not setting CONFIG_HARDENED_USERCOPY or not using vzalloced buffer in the situation where cocurrent kernel-userspace-copy happens? Any feedback will be appreciated. Thanks again for your time. [1] Shared Memory Communications (SMC) enables two SMC capable peers to communicate by using memory buffers that each peer allocates for the partner's use. It improves throughput, lowers latency and cost, and maintains existing functions. See details in https://www.ibm.com/support/pages/node/7009315 [2] https://lore.kernel.org/netdev/1702214654-32069-1-git-send-email-guwen@xxxxxxxxxxxxxxxxx/ [3] issues: https://lore.kernel.org/all/1fbd6b74-1080-923a-01c1-689c3d65f880@xxxxxxxxxx/ analysis: https://lore.kernel.org/all/3189e342-c38f-6076-b730-19a6efd732a5@xxxxxxxxxxxxxxxxx/ [4] Some flamegraphs are attached, - SMC using vzalloced buffer: vzalloc_t96.svg - SMC using vzalloced buffer and with this patchset: vzalloc_t96_improve.svg - SMC using vzalloced buffer and unset CONFIG_HARDENED_USERCOPY: vzalloc_t96_nocheck.svg - SMC using kzalloced buffer: kzalloc_t96.svg Best regards, Wen Gu
Attachment:
vzalloc_t96.svg
Description: image/svg
Attachment:
vzalloc_t96_improve.svg
Description: image/svg
Attachment:
vzalloc_t96_nocheck.svg
Description: image/svg
Attachment:
kzalloc_t96.svg
Description: image/svg