Hi Bart,
On 1/16/23 18:55, Bart Van Assche wrote:
On 1/16/23 06:59, Steffen Maier wrote:
since a few days/weeks, we sometimes see below alua and sleep related kernel
BUG and WARNING (with panic_on_warn) in our CI.
It reminds me of
[PATCH 0/2] Rework how the ALUA driver calls scsi_device_put()
https://lore.kernel.org/linux-scsi/166986602290.2101055.17397734326843853911.b4-ty@xxxxxxxxxx/
which I thought was the fix and went into 6.2-rc(1?) on 2022-12-14 with
[GIT PULL] first round of SCSI updates for the 6.1+ merge window
https://lore.kernel.org/linux-scsi/b2e824bbd1e40da64d2d01657f2f7a67b98919fb.camel@xxxxxxxxxxxxxxxxxxxxx/T/#u
Due to limited history, I cannot tell exactly when problems started and
whether it really correlates to above.
Test workload are all kinds of coverage tests for zfcp recovery including
scsi device removal and/or rescan.
[ 4569.045992] BUG: sleeping function called from invalid context at
drivers/scsi/device_handler/scsi_dh_alua.c:992
Thanks for your report and also for having included this call trace. Is my
understanding correct that alua_rtpg_queue+0x3c refers to the might_sleep()
near the start of alua_rtpg_queue()? If so, please help with testing the
following patch:
diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 49cc18a87473..79afa7acdfbc 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -989,8 +989,6 @@ static bool alua_rtpg_queue(struct alua_port_group
int start_queue = 0;
unsigned long flags;
- might_sleep();
-
if (WARN_ON_ONCE(!pg) || scsi_device_get(sdev))
return false;
I'm proposing this change because the context from which a request is queued
should hold a reference on 'sdev' while a request is in progress so
alua_check_sense() should not trigger the scsi_device_put() call in
alua_rtpg_queue().
How would removing this check solve the other and seemingly more fatal (even
without panic_on_warn) WARNING?:
[ 4760.878107] do not call blocking ops when !TASK_RUNNING; state=2 set at
[<000000017ed2c0fa>] __wait_for_common+0xa2/0x240
FWIW, it seems we only seem to get such reports for debug kernel builds (not
sure which kconfig options are relevant) but not for production / performance
builds.
--
Mit freundlichen Gruessen / Kind regards
Steffen Maier
Linux on IBM Z and LinuxONE
https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: David Faller
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294