Patch "smp,csd: Throw an error if a CSD lock is stuck for too long" has been added to the 6.5-stable tree

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



This is a note to let you know that I've just added the patch titled

    smp,csd: Throw an error if a CSD lock is stuck for too long

to the 6.5-stable tree which can be found at:
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
     smp-csd-throw-an-error-if-a-csd-lock-is-stuck-for-to.patch
and it can be found in the queue-6.5 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let <stable@xxxxxxxxxxxxxxx> know about it.



commit 34d97f7deaa3cc7fe0f915b0264c34f8c3825fa7
Author: Rik van Riel <riel@xxxxxxxxxxx>
Date:   Mon Aug 21 16:04:09 2023 -0400

    smp,csd: Throw an error if a CSD lock is stuck for too long
    
    [ Upstream commit 94b3f0b5af2c7af69e3d6e0cdd9b0ea535f22186 ]
    
    The CSD lock seems to get stuck in 2 "modes". When it gets stuck
    temporarily, it usually gets released in a few seconds, and sometimes
    up to one or two minutes.
    
    If the CSD lock stays stuck for more than several minutes, it never
    seems to get unstuck, and gradually more and more things in the system
    end up also getting stuck.
    
    In the latter case, we should just give up, so the system can dump out
    a little more information about what went wrong, and, with panic_on_oops
    and a kdump kernel loaded, dump a whole bunch more information about what
    might have gone wrong.  In addition, there is an smp.panic_on_ipistall
    kernel boot parameter that by default retains the old behavior, but when
    set enables the panic after the CSD lock has been stuck for more than
    the specified number of milliseconds, as in 300,000 for five minutes.
    
    [ paulmck: Apply Imran Khan feedback. ]
    [ paulmck: Apply Leonardo Bras feedback. ]
    
    Link: https://lore.kernel.org/lkml/bc7cc8b0-f587-4451-8bcd-0daae627bcc7@paulmck-laptop/
    Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
    Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxx>
    Reviewed-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
    Reviewed-by: Leonardo Bras <leobras@xxxxxxxxxx>
    Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
    Cc: Valentin Schneider <vschneid@xxxxxxxxxx>
    Cc: Juergen Gross <jgross@xxxxxxxx>
    Cc: Jonathan Corbet <corbet@xxxxxxx>
    Cc: Randy Dunlap <rdunlap@xxxxxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 23ebe34ff901e..b951b4c2808f1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5781,6 +5781,13 @@
 			This feature may be more efficiently disabled
 			using the csdlock_debug- kernel parameter.
 
+	smp.panic_on_ipistall= [KNL]
+			If a csd_lock_timeout extends for more than
+			the specified number of milliseconds, panic the
+			system.  By default, let CSD-lock acquisition
+			take as long as they take.  Specifying 300,000
+			for this value provides a 5-minute timeout.
+
 	smsc-ircc2.nopnp	[HW] Don't use PNP to discover SMC devices
 	smsc-ircc2.ircc_cfg=	[HW] Device configuration I/O port
 	smsc-ircc2.ircc_sir=	[HW] SIR base I/O port
diff --git a/kernel/smp.c b/kernel/smp.c
index 385179dae360e..0a9a3262f7822 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -168,6 +168,8 @@ static DEFINE_PER_CPU(void *, cur_csd_info);
 
 static ulong csd_lock_timeout = 5000;  /* CSD lock timeout in milliseconds. */
 module_param(csd_lock_timeout, ulong, 0444);
+static int panic_on_ipistall;  /* CSD panic timeout in milliseconds, 300000 for five minutes. */
+module_param(panic_on_ipistall, int, 0444);
 
 static atomic_t csd_bug_count = ATOMIC_INIT(0);
 
@@ -228,6 +230,7 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	}
 
 	ts2 = sched_clock();
+	/* How long since we last checked for a stuck CSD lock.*/
 	ts_delta = ts2 - *ts1;
 	if (likely(ts_delta <= csd_lock_timeout_ns || csd_lock_timeout_ns == 0))
 		return false;
@@ -241,9 +244,17 @@ static bool csd_lock_wait_toolong(struct __call_single_data *csd, u64 ts0, u64 *
 	else
 		cpux = cpu;
 	cpu_cur_csd = smp_load_acquire(&per_cpu(cur_csd, cpux)); /* Before func and info. */
+	/* How long since this CSD lock was stuck. */
+	ts_delta = ts2 - ts0;
 	pr_alert("csd: %s non-responsive CSD lock (#%d) on CPU#%d, waiting %llu ns for CPU#%02d %pS(%ps).\n",
-		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts2 - ts0,
+		 firsttime ? "Detected" : "Continued", *bug_id, raw_smp_processor_id(), ts_delta,
 		 cpu, csd->func, csd->info);
+	/*
+	 * If the CSD lock is still stuck after 5 minutes, it is unlikely
+	 * to become unstuck. Use a signed comparison to avoid triggering
+	 * on underflows when the TSC is out of sync between sockets.
+	 */
+	BUG_ON(panic_on_ipistall > 0 && (s64)ts_delta > ((s64)panic_on_ipistall * NSEC_PER_MSEC));
 	if (cpu_cur_csd && csd != cpu_cur_csd) {
 		pr_alert("\tcsd: CSD lock (#%d) handling prior %pS(%ps) request.\n",
 			 *bug_id, READ_ONCE(per_cpu(cur_csd_func, cpux)),



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux