RE: [PATCH 13/16] fnic: allocate device reset command on the fly

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wednesday, November 15, 2023 4:22 PM, Karan Tilak Kumar (kartilak) wrote:
>
> On Tuesday, November 14, 2023 12:02 AM, Hannes Reinecke <hare@xxxxxxx> wrote:
> > > I integrated your patch set using "b4 shazam" and built all the changes to do some dev testing.
> > > I instrumented the code to do the following:
> > >
> > > - After one million IOs, drop one IO response.
> > > - This will trigger an abort. Drop that abort response.
> > > - This will trigger a device reset. I'm seeing that the tag here is 0xFFFFFFFF (SCSI_NO_TAG).
> > >
> > > This tag fails the out-of-range tag check and escalates to host reset.
> > > I've been able to repro this three times with the same result.
> > >
> > > The expectation with this test case is that the device reset should go through successfully without escalating to host reset.
> > >
> > Thanks for the report.
> > Certainly not what was intended with the patch. Can you try with this patch on top:
>
> I applied this patch on top of the previous one and performed the unit tests that I had mentioned before.
>
> >
> > (The first hunk is actually a bugfix, and will be included with the next submission anyway.) Please enable fnic debug during testing and send me the logs.
> > It _might_ be that this is actually by design, as we will be failing if all tags are busy (ie if the error is EWOULDBLOCK).
> > But if we get EAGAIN it means that the queue is blocked and we'll need to investigate.
>
> There were two attempts at reproduction. One of them resulted in a crash. I have enabled fnic debug and captured the log.
> I've pasted the excerpt here for reference.
>
> Repro 1:
>
> Nov 15 16:08:14 localhost kernel: fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0x91 sc: 00000000bbe9fd78 CDB Opcode: 0x28 Dropping icmnd completion
> Nov 15 16:08:35 localhost multipathd[1746]: burst continued 30025 ms, too long time, stopped
> Nov 15 16:08:36 localhost kernel: sched: RT throttling activated
> Nov 15 16:08:44 localhost kernel: scsi host7: Abort Cmd called FCID 0x5205fa, LUN 0x1 TAG 91 flags 3
> Nov 15 16:08:44 localhost kernel: scsi host7: CBD Opcode: 28 Abort issued time: 30036 msec
> Nov 15 16:08:44 localhost kernel: scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0x91 sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion
> Nov 15 16:09:05 localhost multipathd[1746]: burst continued 30041 ms, too long time, stopped
> Nov 15 16:09:06 localhost kernel: scsi host7: Returning from abort cmd type 2 FAILED
> Nov 15 16:09:06 localhost kernel: scsi host7: Device reset called FCID 0x5205fa, LUN 0x1 sc: 00000000bbe9fd78
> Nov 15 16:09:06 localhost kernel: scsi host7: TAG ffffffff sc: 000000005ee1ff69
> Nov 15 16:09:06 localhost kernel: scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_SUCCESS
> Nov 15 16:09:17 localhost kernel: scsi host7: Device reset timed out
> Nov 15 16:09:17 localhost kernel: scsi host7: Abort and terminate issued on Device reset tag 0xffffffff
> Nov 15 16:09:17 localhost kernel: scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_IO_NOT_FOUND
> Nov 15 16:09:27 localhost kernel: scsi host7: Device reset completed - failed
> Nov 15 16:09:27 localhost kernel: scsi host7: Returning from device reset FAILED
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_reset called
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x52060c
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fb
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fa
> Nov 15 16:09:27 localhost kernel: scsi host7: update_mac 00:25:b5:bb:bb:a0
> Nov 15 16:09:27 localhost kernel: scsi host7: Issued fw reset
> Nov 15 16:09:27 localhost kernel: scsi host7: set port_id 0 fp 0000000000000000
> Nov 15 16:09:27 localhost kernel: scsi host7: Returning from fnic reset SUCCESS
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x52060c
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fb
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_cleanup_io: tag:0x91 : sc:0x00000000bbe9fd78 duration = 72835 DID_TRANSPORT_DISRUPTED
> Nov 15 16:09:27 localhost kernel: scsi host7: Calling done for IO not issued to fw: tag:0x91 sc:0x00000000bbe9fd78
> Nov 15 16:09:27 localhost kernel: scsi host7: reset cmpl success
> Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fa
> Nov 15 16:09:29 localhost kernel: host7: Assigned Port ID 0d0840
> Nov 15 16:09:29 localhost kernel: scsi host7: set port_id d0840 fp 0000000046a33a9d
> Nov 15 16:09:29 localhost kernel: scsi host7: update_mac 0e:fc:00:0d:08:40
> Nov 15 16:09:29 localhost kernel: scsi host7: FLOGI reg issued fcid d0840 map 0 dest 8c:60:4f:95:ea:a4
> Nov 15 16:09:29 localhost kernel: scsi host7: flog reg succeeded
> Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: sdb - tur checker timed out
> Nov 15 16:09:38 localhost multipathd[1746]: checker failed path 8:16 in map 3600a098038303873633f4d415156374d
> Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: remaining active paths: 1
> Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: sde - tur checker timed out
> Nov 15 16:09:38 localhost multipathd[1746]: checker failed path 8:64 in map 3600a098038303873633f4d415156374e
> Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: remaining active paths: 1
> Nov 15 16:09:38 localhost multipathd[1746]: sdb: mark as failed
> Nov 15 16:09:38 localhost multipathd[1746]: sde: mark as failed
> Nov 15 16:09:38 localhost kernel: device-mapper: multipath: 253:3: Failing path 8:16.
> Nov 15 16:09:38 localhost kernel: device-mapper: multipath: 253:4: Failing path 8:64.
> Nov 15 16:09:40 localhost kernel: sd 7:0:2:1: Power-on or device reset occurred
> Nov 15 16:09:40 localhost kernel: sd 7:0:1:1: Power-on or device reset occurred
> Nov 15 16:09:40 localhost kernel: sd 7:0:1:0: Power-on or device reset occurred
> Nov 15 16:09:40 localhost kernel: sd 7:0:2:0: Power-on or device reset occurred
> Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: sdb - tur checker reports path is up
> Nov 15 16:09:43 localhost multipathd[1746]: 8:16: reinstated
> Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: remaining active paths: 2
> Nov 15 16:09:43 localhost kernel: device-mapper: multipath: 253:3: Reinstating path 8:16.
> Nov 15 16:09:43 localhost kernel: device-mapper: multipath: 253:4: Reinstating path 8:64.
> Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: sde - tur checker reports path is up
> Nov 15 16:09:43 localhost multipathd[1746]: 8:64: reinstated
> Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: remaining active paths: 2
>
> =============
> Repro 2:
>
> [  184.517949] fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0xeb sc: 000000000d852f23 CDB Opcode: 0x28 Dropping icmnd completion
> [  214.523460] scsi host7: Abort Cmd called FCID 0x5205fb, LUN 0x1 TAG eb flags 3
> [  214.523474] scsi host7: CBD Opcode: 28 Abort issued time: 30026 msec
> [  214.523548] scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0xeb sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion
> [  236.835184] scsi host7: Returning from abort cmd type 2 FAILED
> [  236.943669] scsi host7: Device reset called FCID 0x5205fb, LUN 0x1 sc: 000000000d852f23
> [  236.943688] scsi host7: TAG ffffffff sc: 00000000ea33dc31
> [  236.956270] scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_SUCCESS
> [  247.074054] scsi host7: Device reset timed out
> [  247.074063] scsi host7: Abort and terminate issued on Device reset tag 0xffffffff
> [  247.074070] scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_IO_NOT_FOUND
> [  257.312999] scsi host7: Device reset completed - failed
> [  257.313008] BUG: kernel NULL pointer dereference, address: 000000000000001c
> [  257.313016] #PF: supervisor read access in kernel mode
> [  257.313021] #PF: error_code(0x0000) - not-present page
> [  257.313028] PGD 80000010c746c067 P4D 80000010c746c067 PUD 10c746d067 PMD 0
> [  257.313041] Oops: 0000 [#1] PREEMPT SMP PTI
> [  257.313049] CPU: 0 PID: 1114 Comm: scsi_eh_7 Kdump: loaded Not tainted 6.4.7 #5
> [  257.313058] Hardware name: Cisco Systems Inc UCSB-B200-M5/UCSB-B200-M5, BIOS B200M5.4.0.2a.0.1102181538 11/02/2018
> [  257.313063] RIP: 0010:dma_direct_unmap_sg+0x58/0x1d0
> [  257.313080] Code: 89 f5 89 d3 4d 89 c5 45 31 ff eb 1e 83 e0 fe 89 45 1c 48 89 ef 41 83 c7 01 e8 b4 b9 44 00 48 89 c5 44 39 fb 0f 84 39 01 00 00 <8b> 45 1c a8 01 75 db 4c 8b 05 f2 1b 68 01 49 8b bc 24 58 02 00 00
> [  257.313087] RSP: 0018:ffff9cfa49157cc0 EFLAGS: 00010246
> [  257.313094] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002
> [  257.313099] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8fab45f470d0
> [  257.313103] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffff7fff
> [  257.313107] R10: ffff9cfa49157b18 R11: ffffffff969e5d28 R12: ffff8fab45f470d0
> [  257.313112] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  257.313116] FS:  0000000000000000(0000) GS:ffff8faaff800000(0000) knlGS:0000000000000000
> [  257.313122] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  257.313128] CR2: 000000000000001c CR3: 00000010c874c002 CR4: 00000000007706f0
> [  257.313133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  257.313137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  257.313141] PKRU: 55555554
> [  257.313145] Call Trace:
> [  257.313150]  <TASK>
> [  257.313158]  ? __die+0x24/0x70
> [  257.313171]  ? page_fault_oops+0x82/0x150
> [  257.313181]  ? irq_work_queue+0x32/0x60
> [  257.313190]  ? vprintk_emit+0x100/0x230
> [  257.313200]  ? exc_page_fault+0x69/0x150
> [  257.313211]  ? asm_exc_page_fault+0x26/0x30
> [  257.313228]  ? dma_direct_unmap_sg+0x58/0x1d0
> [  257.313244]  fnic_release_ioreq_buf+0x23/0xc0 [fnic]
> [  257.313274]  fnic_device_reset+0x57c/0x860 [fnic]
> [  257.313297]  ? sched_clock_cpu+0xf/0x190
> [  257.313308]  ? raw_spin_rq_lock_nested+0x1d/0x80
> [  257.313318]  ? newidle_balance+0x26e/0x400
> [  257.313329]  scsi_eh_bus_device_reset+0xee/0x2e0
> [  257.313343]  scsi_eh_ready_devs+0x6e/0x2f0
> [  257.313366]  ? rpm_resume+0x411/0x750
> [  257.313372]  scsi_unjam_host+0x10a/0x200
> [  257.313377]  scsi_error_handler+0x144/0x1d0
> [  257.313380]  ? __pfx_scsi_error_handler+0x10/0x10
> [  257.313383]  kthread+0xe3/0x120
> [  257.313386]  ? __pfx_kthread+0x10/0x10
> [  257.313388]  ret_from_fork+0x29/0x50
> [  257.313394]  </TASK>
> [  257.313395] Modules linked in: dm_round_robin snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore nft_compat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables rfkill nfnetlink qrtr dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass rapl intel_cstate joydev mei_me intel_uncore ioatdma mei intel_pch_thermal ipmi_ssif pcspkr dca lpc_ich acpi_ipmi ipmi_si acpi_pad acpi_power_meter xfs libcrc32c mgag200 sd_mod t10_pi drm_kms_helper crc64_rocksoft_generic crc64_rocksoft crc64 sg fnic syscopyarea sysfillrect sysimgblt i2c_algo_bit libfcoe drm_shmem_helper crct10dif_pclmul ahci crc32_pclmul crc32c_intel libahci libfc ghash_clmulni_intel drm libata
> enic scsi_transport_fc sha512_ssse3 megaraid_sas wmi dm_mirror
> [  257.313470]  dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse
> [  257.313478] CR2: 000000000000001c
>
>
> Regards,
> Karan
>

Repro 3: 

Nov 17 15:27:00 rhel-c4s5 kernel: fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0x52 sc: 00000000fab117f4 CDB Opcode: 0x28 Dropping icmnd completion
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: Abort Cmd called FCID 0x5205f2, LUN 0x2 TAG 52 flags 3
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: CBD Opcode: 28 Abort issued time: 30067 msec
Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0x52 sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from abort cmd type 2 FAILED
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset called FCID 0x5205f2, LUN 0x2 sc: 00000000fab117f4
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset allocation failed with -11
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_reset called
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: update_mac 00:25:b5:cc:aa:00
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Issued fw reset
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: set port_id 0 fp 0000000000000000
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from fnic reset SUCCESS
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_cleanup_io: tag:0x52 : sc:0x00000000fab117f4 duration = 52450 DID_TRANSPORT_DISRUPTED
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Calling done for IO not issued to fw: tag:0x52 sc:0x00000000fab117f4
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: reset cmpl success
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb
Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca
Nov 17 15:27:54 rhel-c4s5 kernel: host7: Assigned Port ID 0d0880
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: set port_id d0880 fp 0000000067a66565
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: update_mac 0e:fc:00:0d:08:80
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: FLOGI reg issued fcid d0880 map 0 dest 8c:60:4f:95:ea:a4
Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: flog reg succeeded
Nov 17 15:28:06 rhel-c4s5 kernel: sd 7:0:3:2: Power-on or device reset occurred

Regards,
Karan




[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux