Re: [PATCH v2] mm/memory-failure: Use raw_spinlock_t in struct memory_failure_cpu

Waiman Long <longman@xxxxxxxxxx> · Wed, 7 Aug 2024 13:41:50 -0400

On 8/6/24 23:15, Miaohe Lin wrote:
On 2024/8/7 0:41, Waiman Long wrote:
The memory_failure_cpu structure is a per-cpu structure. Access to its
content requires the use of get_cpu_var() to lock in the current CPU
and disable preemption. The use of a regular spinlock_t for locking
purpose is fine for a non-RT kernel.

Since the integration of RT spinlock support into the v5.15 kernel,
a spinlock_t in a RT kernel becomes a sleeping lock and taking a
sleeping lock in a preemption disabled context is illegal resulting in
the following kind of warning.

   [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
   [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0
   [12135.732252] preempt_count: 1, expected: 0
   [12135.732255] RCU nest depth: 2, expected: 2
     :
   [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021
   [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred
   [12135.732433] Call Trace:
   [12135.732436]  <TASK>
   [12135.732450]  dump_stack_lvl+0x57/0x81
   [12135.732461]  __might_resched.cold+0xf4/0x12f
   [12135.732479]  rt_spin_lock+0x4c/0x100
   [12135.732491]  memory_failure_queue+0x40/0xe0
   [12135.732503]  ghes_do_memory_failure+0x53/0x390
   [12135.732516]  ghes_do_proc.constprop.0+0x229/0x3e0
   [12135.732575]  ghes_proc+0xf9/0x1a0
   [12135.732591]  ghes_notify_hed+0x6a/0x150
   [12135.732602]  notifier_call_chain+0x43/0xb0
   [12135.732626]  blocking_notifier_call_chain+0x43/0x60
   [12135.732637]  acpi_ev_notify_dispatch+0x47/0x70
   [12135.732648]  acpi_os_execute_deferred+0x13/0x20
   [12135.732654]  process_one_work+0x41f/0x500
   [12135.732695]  worker_thread+0x192/0x360
   [12135.732715]  kthread+0x111/0x140
   [12135.732733]  ret_from_fork+0x29/0x50
   [12135.732779]  </TASK>

Fix it by using a raw_spinlock_t for locking instead. Also move the
pr_err() out of the lock critical section to avoid indeterminate latency
of this call.

Fixes: ea8f5fb8a71f ("HWPoison: add memory_failure_queue()")
We shouldn't have this problem before RT spinlock is supported? If so, this Fixes tag might be wrong.
OK, I can take out the Fixes tag. It is hard to pinpoint a particular RT 
related commit.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
  mm/memory-failure.c | 18 ++++++++++--------
  1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 581d3e5c9117..7aeb5198c2a0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2417,7 +2417,7 @@ struct memory_failure_entry {
  struct memory_failure_cpu {
  	DECLARE_KFIFO(fifo, struct memory_failure_entry,
  		      MEMORY_FAILURE_FIFO_SIZE);
-	spinlock_t lock;
+	raw_spinlock_t lock;
  	struct work_struct work;
  };
  
@@ -2443,19 +2443,21 @@ void memory_failure_queue(unsigned long pfn, int flags)
  {
  	struct memory_failure_cpu *mf_cpu;
  	unsigned long proc_flags;
+	bool buffer_overflow;
  	struct memory_failure_entry entry = {
  		.pfn =		pfn,
  		.flags =	flags,
  	};
  
  	mf_cpu = &get_cpu_var(memory_failure_cpu);
-	spin_lock_irqsave(&mf_cpu->lock, proc_flags);
-	if (kfifo_put(&mf_cpu->fifo, entry))
+	raw_spin_lock_irqsave(&mf_cpu->lock, proc_flags);
+	buffer_overflow = !kfifo_put(&mf_cpu->fifo, entry);
+	if (!buffer_overflow)
  		schedule_work_on(smp_processor_id(), &mf_cpu->work);
-	else
+	raw_spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
+	if (buffer_overflow)
  		pr_err("buffer overflow when queuing memory failure at %#lx\n",
  		       pfn);
Should we put pr_err() further under put_cpu_var()?

Yes, we should probably enable preemption first before calling pr_err(). 
Will make the change in v2.

Thanks,
Longman


-	spin_unlock_irqrestore(&mf_cpu->lock, proc_flags)
  	put_cpu_var(memory_failure_cpu);
  }
Will below diff be more straightforward?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b68953dc9fad..be172cbc6ca9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2553,20 +2553,23 @@ void memory_failure_queue(unsigned long pfn, int flags)
  {
         struct memory_failure_cpu *mf_cpu;
         unsigned long proc_flags;
+       bool buffer_overflow = false;
         struct memory_failure_entry entry = {
                 .pfn =          pfn,
                 .flags =        flags,
         };

         mf_cpu = &get_cpu_var(memory_failure_cpu);
-       spin_lock_irqsave(&mf_cpu->lock, proc_flags);
+       raw_spin_lock_irqsave(&mf_cpu->lock, proc_flags);
         if (kfifo_put(&mf_cpu->fifo, entry))
                 schedule_work_on(smp_processor_id(), &mf_cpu->work);
         else
+               buffer_overflow = true;
+       raw_spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
+       put_cpu_var(memory_failure_cpu);
+       if (buffer_overflow)
                 pr_err("buffer overflow when queuing memory failure at %#lx\n",
                        pfn);
-       spin_unlock_irqrestore(&mf_cpu->lock, proc_flags);
-       put_cpu_var(memory_failure_cpu);
  }
  EXPORT_SYMBOL_GPL(memory_failure_queue);

But no strong opinion.

Thanks.
.