On Wednesday, November 15, 2023 4:22 PM, Karan Tilak Kumar (kartilak) wrote: > > On Tuesday, November 14, 2023 12:02 AM, Hannes Reinecke <hare@xxxxxxx> wrote: > > > I integrated your patch set using "b4 shazam" and built all the changes to do some dev testing. > > > I instrumented the code to do the following: > > > > > > - After one million IOs, drop one IO response. > > > - This will trigger an abort. Drop that abort response. > > > - This will trigger a device reset. I'm seeing that the tag here is 0xFFFFFFFF (SCSI_NO_TAG). > > > > > > This tag fails the out-of-range tag check and escalates to host reset. > > > I've been able to repro this three times with the same result. > > > > > > The expectation with this test case is that the device reset should go through successfully without escalating to host reset. > > > > > Thanks for the report. > > Certainly not what was intended with the patch. Can you try with this patch on top: > > I applied this patch on top of the previous one and performed the unit tests that I had mentioned before. > > > > > (The first hunk is actually a bugfix, and will be included with the next submission anyway.) Please enable fnic debug during testing and send me the logs. > > It _might_ be that this is actually by design, as we will be failing if all tags are busy (ie if the error is EWOULDBLOCK). > > But if we get EAGAIN it means that the queue is blocked and we'll need to investigate. > > There were two attempts at reproduction. One of them resulted in a crash. I have enabled fnic debug and captured the log. > I've pasted the excerpt here for reference. > > Repro 1: > > Nov 15 16:08:14 localhost kernel: fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0x91 sc: 00000000bbe9fd78 CDB Opcode: 0x28 Dropping icmnd completion > Nov 15 16:08:35 localhost multipathd[1746]: burst continued 30025 ms, too long time, stopped > Nov 15 16:08:36 localhost kernel: sched: RT throttling activated > Nov 15 16:08:44 localhost kernel: scsi host7: Abort Cmd called FCID 0x5205fa, LUN 0x1 TAG 91 flags 3 > Nov 15 16:08:44 localhost kernel: scsi host7: CBD Opcode: 28 Abort issued time: 30036 msec > Nov 15 16:08:44 localhost kernel: scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0x91 sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion > Nov 15 16:09:05 localhost multipathd[1746]: burst continued 30041 ms, too long time, stopped > Nov 15 16:09:06 localhost kernel: scsi host7: Returning from abort cmd type 2 FAILED > Nov 15 16:09:06 localhost kernel: scsi host7: Device reset called FCID 0x5205fa, LUN 0x1 sc: 00000000bbe9fd78 > Nov 15 16:09:06 localhost kernel: scsi host7: TAG ffffffff sc: 000000005ee1ff69 > Nov 15 16:09:06 localhost kernel: scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_SUCCESS > Nov 15 16:09:17 localhost kernel: scsi host7: Device reset timed out > Nov 15 16:09:17 localhost kernel: scsi host7: Abort and terminate issued on Device reset tag 0xffffffff > Nov 15 16:09:17 localhost kernel: scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_IO_NOT_FOUND > Nov 15 16:09:27 localhost kernel: scsi host7: Device reset completed - failed > Nov 15 16:09:27 localhost kernel: scsi host7: Returning from device reset FAILED > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_reset called > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x52060c > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fb > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fa > Nov 15 16:09:27 localhost kernel: scsi host7: update_mac 00:25:b5:bb:bb:a0 > Nov 15 16:09:27 localhost kernel: scsi host7: Issued fw reset > Nov 15 16:09:27 localhost kernel: scsi host7: set port_id 0 fp 0000000000000000 > Nov 15 16:09:27 localhost kernel: scsi host7: Returning from fnic reset SUCCESS > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x52060c > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fb > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_cleanup_io: tag:0x91 : sc:0x00000000bbe9fd78 duration = 72835 DID_TRANSPORT_DISRUPTED > Nov 15 16:09:27 localhost kernel: scsi host7: Calling done for IO not issued to fw: tag:0x91 sc:0x00000000bbe9fd78 > Nov 15 16:09:27 localhost kernel: scsi host7: reset cmpl success > Nov 15 16:09:27 localhost kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205fa > Nov 15 16:09:29 localhost kernel: host7: Assigned Port ID 0d0840 > Nov 15 16:09:29 localhost kernel: scsi host7: set port_id d0840 fp 0000000046a33a9d > Nov 15 16:09:29 localhost kernel: scsi host7: update_mac 0e:fc:00:0d:08:40 > Nov 15 16:09:29 localhost kernel: scsi host7: FLOGI reg issued fcid d0840 map 0 dest 8c:60:4f:95:ea:a4 > Nov 15 16:09:29 localhost kernel: scsi host7: flog reg succeeded > Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: sdb - tur checker timed out > Nov 15 16:09:38 localhost multipathd[1746]: checker failed path 8:16 in map 3600a098038303873633f4d415156374d > Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: remaining active paths: 1 > Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: sde - tur checker timed out > Nov 15 16:09:38 localhost multipathd[1746]: checker failed path 8:64 in map 3600a098038303873633f4d415156374e > Nov 15 16:09:38 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: remaining active paths: 1 > Nov 15 16:09:38 localhost multipathd[1746]: sdb: mark as failed > Nov 15 16:09:38 localhost multipathd[1746]: sde: mark as failed > Nov 15 16:09:38 localhost kernel: device-mapper: multipath: 253:3: Failing path 8:16. > Nov 15 16:09:38 localhost kernel: device-mapper: multipath: 253:4: Failing path 8:64. > Nov 15 16:09:40 localhost kernel: sd 7:0:2:1: Power-on or device reset occurred > Nov 15 16:09:40 localhost kernel: sd 7:0:1:1: Power-on or device reset occurred > Nov 15 16:09:40 localhost kernel: sd 7:0:1:0: Power-on or device reset occurred > Nov 15 16:09:40 localhost kernel: sd 7:0:2:0: Power-on or device reset occurred > Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: sdb - tur checker reports path is up > Nov 15 16:09:43 localhost multipathd[1746]: 8:16: reinstated > Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374d: remaining active paths: 2 > Nov 15 16:09:43 localhost kernel: device-mapper: multipath: 253:3: Reinstating path 8:16. > Nov 15 16:09:43 localhost kernel: device-mapper: multipath: 253:4: Reinstating path 8:64. > Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: sde - tur checker reports path is up > Nov 15 16:09:43 localhost multipathd[1746]: 8:64: reinstated > Nov 15 16:09:43 localhost multipathd[1746]: 3600a098038303873633f4d415156374e: remaining active paths: 2 > > ============= > Repro 2: > > [ 184.517949] fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0xeb sc: 000000000d852f23 CDB Opcode: 0x28 Dropping icmnd completion > [ 214.523460] scsi host7: Abort Cmd called FCID 0x5205fb, LUN 0x1 TAG eb flags 3 > [ 214.523474] scsi host7: CBD Opcode: 28 Abort issued time: 30026 msec > [ 214.523548] scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0xeb sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion > [ 236.835184] scsi host7: Returning from abort cmd type 2 FAILED > [ 236.943669] scsi host7: Device reset called FCID 0x5205fb, LUN 0x1 sc: 000000000d852f23 > [ 236.943688] scsi host7: TAG ffffffff sc: 00000000ea33dc31 > [ 236.956270] scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_SUCCESS > [ 247.074054] scsi host7: Device reset timed out > [ 247.074063] scsi host7: Abort and terminate issued on Device reset tag 0xffffffff > [ 247.074070] scsi host7: Tag out of range tag ffffffff hdr status = FCPIO_IO_NOT_FOUND > [ 257.312999] scsi host7: Device reset completed - failed > [ 257.313008] BUG: kernel NULL pointer dereference, address: 000000000000001c > [ 257.313016] #PF: supervisor read access in kernel mode > [ 257.313021] #PF: error_code(0x0000) - not-present page > [ 257.313028] PGD 80000010c746c067 P4D 80000010c746c067 PUD 10c746d067 PMD 0 > [ 257.313041] Oops: 0000 [#1] PREEMPT SMP PTI > [ 257.313049] CPU: 0 PID: 1114 Comm: scsi_eh_7 Kdump: loaded Not tainted 6.4.7 #5 > [ 257.313058] Hardware name: Cisco Systems Inc UCSB-B200-M5/UCSB-B200-M5, BIOS B200M5.4.0.2a.0.1102181538 11/02/2018 > [ 257.313063] RIP: 0010:dma_direct_unmap_sg+0x58/0x1d0 > [ 257.313080] Code: 89 f5 89 d3 4d 89 c5 45 31 ff eb 1e 83 e0 fe 89 45 1c 48 89 ef 41 83 c7 01 e8 b4 b9 44 00 48 89 c5 44 39 fb 0f 84 39 01 00 00 <8b> 45 1c a8 01 75 db 4c 8b 05 f2 1b 68 01 49 8b bc 24 58 02 00 00 > [ 257.313087] RSP: 0018:ffff9cfa49157cc0 EFLAGS: 00010246 > [ 257.313094] RAX: 0000000000000000 RBX: 0000000000000003 RCX: 0000000000000002 > [ 257.313099] RDX: 0000000000000003 RSI: 0000000000000000 RDI: ffff8fab45f470d0 > [ 257.313103] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffff7fff > [ 257.313107] R10: ffff9cfa49157b18 R11: ffffffff969e5d28 R12: ffff8fab45f470d0 > [ 257.313112] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 257.313116] FS: 0000000000000000(0000) GS:ffff8faaff800000(0000) knlGS:0000000000000000 > [ 257.313122] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 257.313128] CR2: 000000000000001c CR3: 00000010c874c002 CR4: 00000000007706f0 > [ 257.313133] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 257.313137] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > [ 257.313141] PKRU: 55555554 > [ 257.313145] Call Trace: > [ 257.313150] <TASK> > [ 257.313158] ? __die+0x24/0x70 > [ 257.313171] ? page_fault_oops+0x82/0x150 > [ 257.313181] ? irq_work_queue+0x32/0x60 > [ 257.313190] ? vprintk_emit+0x100/0x230 > [ 257.313200] ? exc_page_fault+0x69/0x150 > [ 257.313211] ? asm_exc_page_fault+0x26/0x30 > [ 257.313228] ? dma_direct_unmap_sg+0x58/0x1d0 > [ 257.313244] fnic_release_ioreq_buf+0x23/0xc0 [fnic] > [ 257.313274] fnic_device_reset+0x57c/0x860 [fnic] > [ 257.313297] ? sched_clock_cpu+0xf/0x190 > [ 257.313308] ? raw_spin_rq_lock_nested+0x1d/0x80 > [ 257.313318] ? newidle_balance+0x26e/0x400 > [ 257.313329] scsi_eh_bus_device_reset+0xee/0x2e0 > [ 257.313343] scsi_eh_ready_devs+0x6e/0x2f0 > [ 257.313366] ? rpm_resume+0x411/0x750 > [ 257.313372] scsi_unjam_host+0x10a/0x200 > [ 257.313377] scsi_error_handler+0x144/0x1d0 > [ 257.313380] ? __pfx_scsi_error_handler+0x10/0x10 > [ 257.313383] kthread+0xe3/0x120 > [ 257.313386] ? __pfx_kthread+0x10/0x10 > [ 257.313388] ret_from_fork+0x29/0x50 > [ 257.313394] </TASK> > [ 257.313395] Modules linked in: dm_round_robin snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore nft_compat nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables rfkill nfnetlink qrtr dm_multipath intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass rapl intel_cstate joydev mei_me intel_uncore ioatdma mei intel_pch_thermal ipmi_ssif pcspkr dca lpc_ich acpi_ipmi ipmi_si acpi_pad acpi_power_meter xfs libcrc32c mgag200 sd_mod t10_pi drm_kms_helper crc64_rocksoft_generic crc64_rocksoft crc64 sg fnic syscopyarea sysfillrect sysimgblt i2c_algo_bit libfcoe drm_shmem_helper crct10dif_pclmul ahci crc32_pclmul crc32c_intel libahci libfc ghash_clmulni_intel drm libata > enic scsi_transport_fc sha512_ssse3 megaraid_sas wmi dm_mirror > [ 257.313470] dm_region_hash dm_log dm_mod ipmi_devintf ipmi_msghandler fuse > [ 257.313478] CR2: 000000000000001c > > > Regards, > Karan > Repro 3: Nov 17 15:27:00 rhel-c4s5 kernel: fnic<7>: UT: fnic_fcpio_icmnd_cmpl_handler: 847: tag: 0x52 sc: 00000000fab117f4 CDB Opcode: 0x28 Dropping icmnd completion Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: Abort Cmd called FCID 0x5205f2, LUN 0x2 TAG 52 flags 3 Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: CBD Opcode: 28 Abort issued time: 30067 msec Nov 17 15:27:30 rhel-c4s5 kernel: scsi host7: fnic<7>: UT: fnic_fcpio_itmf_cmpl_handler: 1113: tag: 0x52 sc: 00 status: FCPIO_IO_NOT_FOUND Dropping abort completion Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from abort cmd type 2 FAILED Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset called FCID 0x5205f2, LUN 0x2 sc: 00000000fab117f4 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Device reset allocation failed with -11 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_reset called Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: update_mac 00:25:b5:cc:aa:00 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Issued fw reset Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: set port_id 0 fp 0000000000000000 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Returning from fnic reset SUCCESS Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0xfffffc Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_cleanup_io: tag:0x52 : sc:0x00000000fab117f4 duration = 52450 DID_TRANSPORT_DISRUPTED Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: Calling done for IO not issued to fw: tag:0x52 sc:0x00000000fab117f4 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: reset cmpl success Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x52061b Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205f2 Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205cb Nov 17 15:27:52 rhel-c4s5 kernel: scsi host7: fnic_rport_exch_reset called portid 0x5205ca Nov 17 15:27:54 rhel-c4s5 kernel: host7: Assigned Port ID 0d0880 Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: set port_id d0880 fp 0000000067a66565 Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: update_mac 0e:fc:00:0d:08:80 Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: FLOGI reg issued fcid d0880 map 0 dest 8c:60:4f:95:ea:a4 Nov 17 15:27:54 rhel-c4s5 kernel: scsi host7: flog reg succeeded Nov 17 15:28:06 rhel-c4s5 kernel: sd 7:0:3:2: Power-on or device reset occurred Regards, Karan