Re: kernel BUG scsi_dh_alua sleeping from invalid context && kernel WARNING do not call blocking ops when !TASK_RUNNING

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Bart,

On 1/16/23 18:55, Bart Van Assche wrote:
On 1/16/23 06:59, Steffen Maier wrote:
since a few days/weeks, we sometimes see below alua and sleep related kernel BUG and WARNING (with panic_on_warn) in our CI.

It reminds me of
[PATCH 0/2] Rework how the ALUA driver calls scsi_device_put()
https://lore.kernel.org/linux-scsi/166986602290.2101055.17397734326843853911.b4-ty@xxxxxxxxxx/

which I thought was the fix and went into 6.2-rc(1?) on 2022-12-14 with
[GIT PULL] first round of SCSI updates for the 6.1+ merge window
https://lore.kernel.org/linux-scsi/b2e824bbd1e40da64d2d01657f2f7a67b98919fb.camel@xxxxxxxxxxxxxxxxxxxxx/T/#u

Due to limited history, I cannot tell exactly when problems started and whether it really correlates to above.

Test workload are all kinds of coverage tests for zfcp recovery including scsi device removal and/or rescan.

[ 4569.045992] BUG: sleeping function called from invalid context at drivers/scsi/device_handler/scsi_dh_alua.c:992

Thanks for your report and also for having included this call trace. Is my understanding correct that alua_rtpg_queue+0x3c refers to the might_sleep() near the start of alua_rtpg_queue()? If so, please help with testing the following patch:

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c b/drivers/scsi/device_handler/scsi_dh_alua.c
index 49cc18a87473..79afa7acdfbc 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -989,8 +989,6 @@ static bool alua_rtpg_queue(struct alua_port_group
      int start_queue = 0;
      unsigned long flags;

-    might_sleep();
-
      if (WARN_ON_ONCE(!pg) || scsi_device_get(sdev))
          return false;


I'm proposing this change because the context from which a request is queued should hold a reference on 'sdev' while a request is in progress so alua_check_sense() should not trigger the scsi_device_put() call in alua_rtpg_queue().

How would removing this check solve the other and seemingly more fatal (even without panic_on_warn) WARNING?:

[ 4760.878107] do not call blocking ops when !TASK_RUNNING; state=2 set at [<000000017ed2c0fa>] __wait_for_common+0xa2/0x240


FWIW, it seems we only seem to get such reports for debug kernel builds (not sure which kconfig options are relevant) but not for production / performance builds.

--
Mit freundlichen Gruessen / Kind regards
Steffen Maier

Linux on IBM Z and LinuxONE

https://www.ibm.com/privacy/us/en/
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Gregor Pillen
Geschaeftsfuehrung: David Faller
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Kernel Development]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite Info]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Samba]     [Linux Media]     [Device Mapper]

  Powered by Linux