On 11/28/23 11:04, Catalin Marinas wrote:
On Mon, Nov 27, 2023 at 02:41:53PM -0500, Waiman Long wrote:
/**
* kmemleak_free_percpu - unregister a previously registered __percpu object
* @ptr: __percpu pointer to beginning of the object
*
* This function is called from the kernel percpu allocator when an object
- * (memory block) is freed (free_percpu).
+ * (memory block) is freed (free_percpu). Since this function is inherently
+ * slow especially on systems with a large number of CPUs, defer the actual
+ * removal of kmemleak objects associated with the percpu pointer to a
+ * workqueue if it is not in a task context.
*/
void __ref kmemleak_free_percpu(const void __percpu *ptr)
{
- unsigned int cpu;
-
pr_debug("%s(0x%px)\n", __func__, ptr);
- if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
- for_each_possible_cpu(cpu)
- delete_object_full((unsigned long)per_cpu_ptr(ptr,
- cpu));
+ if (!kmemleak_free_enabled || !ptr || IS_ERR(ptr))
+ return;
+
+ if (!in_task()) {
+ struct kmemleak_percpu_addr *addr;
+
+ addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
+ if (addr) {
+ INIT_WORK(&addr->work, kmemleak_free_percpu_workfn);
+ addr->ptr = ptr;
+ queue_work(system_long_wq, &addr->work);
+ return;
+ }
We can't defer this freeing. It can mess up the kmemleak metadata if the
per-cpu pointer is re-allocated before kmemleak removed it from its
object tree.
You are right. In fact, it is possible for kmemleak_free_percpu() be
called from softIRQ context. And if the system has hundreds of CPUs, it
will take a long time to process all the free request.
The problem is looking up the object tree for each per-cpu offset. We
can make the percpu pointer handling O(1) since freeing is only done by
the main __percpu pointer, so that's the only one needing a look-up. So
far the per-cpu pointers are not tracked for leaking, only scanned.
We could just add the per_cpu_ptr(ptr, 0) to the kmemleak
object_tree_root but when scanning we don't have an inverse function to
get the __percpu pointer back and calculate the pointers for the other
CPUs (well, we could with some hacks but they are probably fragile).
We could keep a separate tree to track the percpu area. We will know the
max percpu offset in each percpu area. The base of the percpu area is
just per_cpu_ptr(0, cpu).
What I came up with is a separate object_percpu_tree_root similar to the
object_phys_tree_root. The only reason for these additional trees is to
look up the kmemleak metadata when needed (usually freeing). They don't
contain objects that are tracked for actual leaking, only scanned. A
briefly tested patch below. I need to go through it again, update some
comments and write a commit log:
That sounds like a good idea like what I have said above. I will do a
more careful review of the change tomorrow as it is getting late for me
today.
Cheers,
Longman