Re: [PATCH] mm/kmemleak: Add cond_resched() to kmemleak_free_percpu()

Waiman Long <longman@xxxxxxxxxx> · Tue, 28 Nov 2023 21:35:34 -0500

On 11/28/23 11:04, Catalin Marinas wrote:
On Mon, Nov 27, 2023 at 02:41:53PM -0500, Waiman Long wrote:
  /**
   * kmemleak_free_percpu - unregister a previously registered __percpu object
   * @ptr:	__percpu pointer to beginning of the object
   *
   * This function is called from the kernel percpu allocator when an object
- * (memory block) is freed (free_percpu).
+ * (memory block) is freed (free_percpu). Since this function is inherently
+ * slow especially on systems with a large number of CPUs, defer the actual
+ * removal of kmemleak objects associated with the percpu pointer to a
+ * workqueue if it is not in a task context.
   */
  void __ref kmemleak_free_percpu(const void __percpu *ptr)
  {
-	unsigned int cpu;
-
  	pr_debug("%s(0x%px)\n", __func__, ptr);

-	if (kmemleak_free_enabled && ptr && !IS_ERR(ptr))
-		for_each_possible_cpu(cpu)
-			delete_object_full((unsigned long)per_cpu_ptr(ptr,
-								      cpu));
+	if (!kmemleak_free_enabled || !ptr || IS_ERR(ptr))
+		return;
+
+	if (!in_task()) {
+		struct kmemleak_percpu_addr *addr;
+
+		addr = kzalloc(sizeof(*addr), GFP_ATOMIC);
+		if (addr) {
+			INIT_WORK(&addr->work, kmemleak_free_percpu_workfn);
+			addr->ptr = ptr;
+			queue_work(system_long_wq, &addr->work);
+			return;
+		}
We can't defer this freeing. It can mess up the kmemleak metadata if the
per-cpu pointer is re-allocated before kmemleak removed it from its
object tree.
You are right. In fact, it is possible for kmemleak_free_percpu() be 
called from softIRQ context. And if the system has hundreds of CPUs, it 
will take a long time to process all the free request.

The problem is looking up the object tree for each per-cpu offset. We
can make the percpu pointer handling O(1) since freeing is only done by
the main __percpu pointer, so that's the only one needing a look-up. So
far the per-cpu pointers are not tracked for leaking, only scanned.

We could just add the per_cpu_ptr(ptr, 0) to the kmemleak
object_tree_root but when scanning we don't have an inverse function to
get the __percpu pointer back and calculate the pointers for the other
CPUs (well, we could with some hacks but they are probably fragile).
We could keep a separate tree to track the percpu area. We will know the 
max percpu offset in each percpu area. The base of the percpu area is 
just per_cpu_ptr(0, cpu).

What I came up with is a separate object_percpu_tree_root similar to the
object_phys_tree_root. The only reason for these additional trees is to
look up the kmemleak metadata when needed (usually freeing). They don't
contain objects that are tracked for actual leaking, only scanned. A
briefly tested patch below. I need to go through it again, update some
comments and write a commit log:

That sounds like a good idea like what I have said above. I will do a 
more careful review of the change tomorrow as it is getting late for me 
today.

Cheers,
Longman