Re: [Bug 214147] New: ISCSI broken in last release

michael.christie@xxxxxxxxxx · Wed, 1 Sep 2021 18:48:28 -0500

On 8/23/21 6:08 AM, bugzilla-daemon@xxxxxxxxxxxxxxxxxxx wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=214147
> 
>             Bug ID: 214147
>            Summary: ISCSI broken in last release
>            Product: IO/Storage
>            Version: 2.5
>     Kernel Version: 5.13.12
>           Hardware: All
>                 OS: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: SCSI
>           Assignee: linux-scsi@xxxxxxxxxxxxxxx
>           Reporter: slavon.net@xxxxxxxxx
>         Regression: Yes
> 
> Created attachment 298441
>   --> https://bugzilla.kernel.org/attachment.cgi?id=298441&action=edit
> dmesg log
> 
> After some time iscsi go to broke and help only reboot
> 
What are you doing when you hit the issue?

What does your target setup look like? What are you using for the
backing store?

Are you able to build your own kernels?

The only major changes between 5.12 and 5.13 is some target patches
to batch cmds. However, it looks like you start to hit a problem
earlier than when that code comes into play. We first see you hit
a data out timeout, so we don't even have all the data for the
cmd, so the target changes in 5.13 don't come into play yet.

[10931.107057] Unable to recover from DataOut timeout while in ERL=0, closing iSCSI connection for I_T Nexus iqn.1991-05.com.microsoft:vhost11.dev.obs.group,i,0x400001370002,iqn.2003-01.org.linux-iscsi.vm2.x8664:sn.b07943625401,t,0x01

However, we do see some cmds have made it to the core target layer
because we can see the target layer is waiting on cmds to complete
for part of the lun reset handling:

[19906.593285] INFO: task kworker/4:1:3770999 blocked for more than 122 seconds.
[19906.603670]       Tainted: P           O      5.13.12-1.el8.elrepo.x86_64 #1
[19906.613975] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[19906.624208] task:kworker/4:1     state:D stack:    0 pid:3770999 ppid:     2 flags:0x00004000
[19906.624212] Workqueue: events target_tmr_work [target_core_mod]
[19906.624247] Call Trace:
[19906.624249]  __schedule+0x396/0x8a0
[19906.624252]  schedule+0x3c/0xa0
[19906.624255]  schedule_timeout+0x215/0x2b0
[19906.624258]  ? kasprintf+0x4e/0x70
[19906.624261]  wait_for_completion+0x9e/0x100
[19906.624264]  target_put_cmd_and_wait+0x55/0x80 [target_core_mod]
[19906.624279]  core_tmr_lun_reset+0x38b/0x660 [target_core_mod]
[19906.624294]  target_tmr_work+0xb4/0x110 [target_core_mod]
[19906.624309]  process_one_work+0x230/0x3d0
[19906.624312]  worker_thread+0x2d/0x3e0
[19906.624314]  ? process_one_work+0x3d0/0x3d0
[19906.624316]  kthread+0x118/0x140
[19906.624318]  ? set_kthread_struct+0x40/0x40
[19906.624320]  ret_from_fork+0x1f/0x30

and we can see iscsi layer not able to relogin because of outstanding
cmds/tmfs.

I can send you a patch that reverts the core target patches. If we can
rule them out then it would help narrow things down.

Or, because it sounds like this is easy to reproduce we can turn on some
extra lio debugging.