From: Scott Cheloha <cheloha@xxxxxxxxxxxxxxxxxx> Subject: drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup Searching for a particular memory block by id is an O(n) operation because each memory block's underlying device is kept in an unsorted linked list on the subsystem bus. We can cut the lookup cost to O(log n) if we cache each memory block in an xarray. This time complexity improvement is significant on systems with many memory blocks. For example: 1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks. With this change memory_dev_init() completes ~12ms faster and walk_memory_blocks() completes ~12ms faster. Before: [ 0.005042] memory_dev_init: adding memory blocks [ 0.021591] memory_dev_init: added memory blocks [ 0.022699] walk_memory_blocks: walking memory blocks [ 0.038730] walk_memory_blocks: walked memory blocks 0-511 After: [ 0.005057] memory_dev_init: adding memory blocks [ 0.009415] memory_dev_init: added memory blocks [ 0.010519] walk_memory_blocks: walking memory blocks [ 0.014135] walk_memory_blocks: walked memory blocks 0-511 2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks. With this change memory_dev_init() completes ~88ms faster and walk_memory_blocks() completes ~87ms faster. Before: [ 0.252246] memory_dev_init: adding memory blocks [ 0.395469] memory_dev_init: added memory blocks [ 0.409413] walk_memory_blocks: walking memory blocks [ 0.433028] walk_memory_blocks: walked memory blocks 0-511 [ 0.433094] walk_memory_blocks: walking memory blocks [ 0.500244] walk_memory_blocks: walked memory blocks 131072-131583 After: [ 0.245063] memory_dev_init: adding memory blocks [ 0.299539] memory_dev_init: added memory blocks [ 0.313609] walk_memory_blocks: walking memory blocks [ 0.315287] walk_memory_blocks: walked memory blocks 0-511 [ 0.315349] walk_memory_blocks: walking memory blocks [ 0.316988] walk_memory_blocks: walked memory blocks 131072-131583 3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks. With this change we complete memory_dev_init() ~37 minutes faster and walk_memory_blocks() at least ~30 minutes faster. The exact timing for walk_memory_blocks() is missing, though I observed that the soft lockups in walk_memory_blocks() disappeared with the change, suggesting that lower bound. Before: [ 13.703907] memory_dev_init: adding blocks [ 2287.406099] memory_dev_init: added all blocks [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160 After: [ 13.696898] memory_dev_init: adding blocks [ 15.660035] memory_dev_init: added all blocks (the walk_memory_blocks traces disappear) There should be no significant negative impact for machines with few memory blocks. A sparse xarray has a small footprint and an O(log n) lookup is negligibly slower than an O(n) lookup for only the smallest number of memory blocks. 1. A 16GB x86 machine with 128MB memblocks has 132 blocks. With this change memory_dev_init() completes ~300us faster and walk_memory_blocks() completes no faster or slower. The improvement is pretty close to noise. Before: [ 0.224752] memory_dev_init: adding memory blocks [ 0.227116] memory_dev_init: added memory blocks [ 0.227183] walk_memory_blocks: walking memory blocks [ 0.227183] walk_memory_blocks: walked memory blocks 0-131 After: [ 0.224911] memory_dev_init: adding memory blocks [ 0.226935] memory_dev_init: added memory blocks [ 0.227089] walk_memory_blocks: walking memory blocks [ 0.227089] walk_memory_blocks: walked memory blocks 0-131 [david@xxxxxxxxxx: document the locking] Link: http://lkml.kernel.org/r/bc21eec6-7251-4c91-2f57-9a0671f8d414@xxxxxxxxxx Link: http://lkml.kernel.org/r/20200121231028.13699-1-cheloha@xxxxxxxxxxxxx Signed-off-by: Scott Cheloha <cheloha@xxxxxxxxxxxxx> Acked-by: David Hildenbrand <david@xxxxxxxxxx> Acked-by: Nathan Lynch <nathanl@xxxxxxxxxxxxx> Acked-by: Michal Hocko <mhocko@xxxxxxxx> Cc: Rafael J. Wysocki <rafael@xxxxxxxxxx> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx> Cc: Rick Lindsley <ricklind@xxxxxxxxxxxxxxxxxx> Cc: Scott Cheloha <cheloha@xxxxxxxxxxxxx> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> --- drivers/base/memory.c | 44 +++++++++++++++++++++++++++++----------- 1 file changed, 32 insertions(+), 12 deletions(-) --- a/drivers/base/memory.c~drivers-base-memoryc-cache-memory-blocks-in-xarray-to-accelerate-lookup +++ a/drivers/base/memory.c @@ -21,6 +21,7 @@ #include <linux/mm.h> #include <linux/stat.h> #include <linux/slab.h> +#include <linux/xarray.h> #include <linux/atomic.h> #include <linux/uaccess.h> @@ -74,6 +75,13 @@ static struct bus_type memory_subsys = { .offline = memory_subsys_offline, }; +/* + * Memory blocks are cached in a local radix tree to avoid + * a costly linear search for the corresponding device on + * the subsystem bus. + */ +static DEFINE_XARRAY(memory_blocks); + static BLOCKING_NOTIFIER_HEAD(memory_chain); int register_memory_notifier(struct notifier_block *nb) @@ -489,22 +497,23 @@ int __weak arch_get_memory_phys_device(u return 0; } -/* A reference for the returned memory block device is acquired. */ +/* + * A reference for the returned memory block device is acquired. + * + * Called under device_hotplug_lock. + */ static struct memory_block *find_memory_block_by_id(unsigned long block_id) { - struct device *dev; + struct memory_block *mem; - dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL); - return dev ? to_memory_block(dev) : NULL; + mem = xa_load(&memory_blocks, block_id); + if (mem) + get_device(&mem->dev); + return mem; } /* - * For now, we have a linear search to go find the appropriate - * memory_block corresponding to a particular phys_index. If - * this gets to be a real problem, we can always use a radix - * tree or something here. - * - * This could be made generic for all device subsystems. + * Called under device_hotplug_lock. */ struct memory_block *find_memory_block(struct mem_section *section) { @@ -548,9 +557,16 @@ int register_memory(struct memory_block memory->dev.offline = memory->state == MEM_OFFLINE; ret = device_register(&memory->dev); - if (ret) + if (ret) { put_device(&memory->dev); - + return ret; + } + ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory, + GFP_KERNEL)); + if (ret) { + put_device(&memory->dev); + device_unregister(&memory->dev); + } return ret; } @@ -604,6 +620,8 @@ static void unregister_memory(struct mem if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys)) return; + WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL); + /* drop the ref. we got via find_memory_block() */ put_device(&memory->dev); device_unregister(&memory->dev); @@ -750,6 +768,8 @@ void __init memory_dev_init(void) * * In case func() returns an error, walking is aborted and the error is * returned. + * + * Called under device_hotplug_lock. */ int walk_memory_blocks(unsigned long start, unsigned long size, void *arg, walk_memory_blocks_func_t func) _