On Tue, 2023-06-27 at 12:29 -0400, Laurence Oberman wrote: > Hello > > A customer discovered this on a RHEL 8.8 kernel but the issue also > exists upstream with the current code in 6.4 for example. > > [ 177.143279] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx] > [ 177.149165] ? internal_add_timer+0x42/0x70 > [ 177.153372] qla2xxx_mqueuecommand+0x207/0x2b0 [qla2xxx] > [ 177.158730] scsi_queue_rq+0x2b7/0xc00 > [ 177.162501] blk_mq_dispatch_rq_list+0x3ea/0x7e0 > > Simple reproducer to a LUN with no protection > sg_write_same -T --lba=1 /dev/sdxx (or mpath) > > With the device having no protection we land up with > SCSI_PROT_NORMAL being used so fall through to the BUG() > > switch (scsi_get_prot_op(GET_CMD_SP(sp))) { > case SCSI_PROT_READ_INSERT: > case SCSI_PROT_WRITE_STRIP: > total_bytes = data_bytes; > data_bytes += dif_bytes; > break; > > case SCSI_PROT_READ_STRIP: > case SCSI_PROT_WRITE_INSERT: > case SCSI_PROT_READ_PASS: > case SCSI_PROT_WRITE_PASS: > total_bytes = data_bytes + dif_bytes; > break; > default: > BUG(); > } > > > I also had David Jeffery look at this and his comment was > > In this particular case, it looks like the issue is just with > qla2xxx, > regardless of the hardware. The scsi_disk being sent the command had > no > dif protection enabled and there was no dix data. > > crash> struct scsi_disk.protection_type 0xff34947432176800 > protection_type = 0 '\000', > > crash> px ((struct scsi_cmnd *)0xff3494740b759138)->prot_sdb[0] > $7 = { > table = { > sgl = 0xff3494740b7595a8, > nents = 0x0, > orig_nents = 0x0 > }, > length = 0x0, > resid = 0x0 > } > > So a WRITE_SAME_32 prot_op was always going to be SCSI_PROT_NORMAL in > prot_op. qla2xxx should not crash when passed such a command and > state. > > > KDUMP > Linux > segstorage3 > 6.4.0+ > > [ 176.960932] ------------[ cut here ]------------ > [ 176.965582] kernel BUG at drivers/scsi/qla2xxx/qla_iocb.c:1459! > [ 176.971540] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI > [ 176.976795] CPU: 10 PID: 16058 Comm: sg_write_same Kdump: loaded > Tainted: G S 6.4.0+ #1 > [ 176.986240] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 > Gen10, BIOS U30 05/17/2022 > [ 176.994812] RIP: 0010:qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 > [qla2xxx] > [ 177.001337] Code: ff ff 48 8b 7c 24 40 0f b7 bf 4c 01 00 00 e9 73 > f6 > ff ff 83 3d 68 a0 de ff 01 0f 8e 7b fd ff ff e9 6f fd ff ff e8 b8 7f > 07 > ce <0f> 0b 66 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 > 90 > [ 177.020217] RSP: 0018:ffffa1c44f86b9e0 EFLAGS: 00010046 > [ 177.025470] RAX: 0000000000000008 RBX: ffff961087e29000 RCX: > 0000000000000000 > [ 177.032644] RDX: 0000000000000000 RSI: ffff9617c9e09460 RDI: > 0000000000000200 > [ 177.039818] RBP: ffff9617c9e09588 R08: ffff9617c9e09460 R09: > 0000000000000200 > [ 177.046992] R10: ffff96107800e880 R11: 0000000000000000 R12: > 00000000000010c0 > [ 177.054165] R13: ffff96107800e880 R14: ffff961064c52180 R15: > ffff961066f8de00 > [ 177.061337] FS: 00007f41eef7e740(0000) GS:ffff961f4d800000(0000) > knlGS:0000000000000000 > [ 177.069471] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 177.075246] CR2: 000055e1e2591bd8 CR3: 00000008823b2005 CR4: > 00000000007706e0 > [ 177.082420] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 177.089594] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 177.096768] PKRU: 55555554 > [ 177.099487] Call Trace: > [ 177.101944] <TASK> > [ 177.104052] ? __die_body+0x1e/0x60 > [ 177.107560] ? die+0x3c/0x60 > [ 177.110454] ? do_trap+0xe6/0x110 > [ 177.113786] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx] > [ 177.119674] ? do_error_trap+0x65/0x80 > [ 177.123442] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx] > [ 177.129328] ? exc_invalid_op+0x50/0x70 > [ 177.133184] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx] > [ 177.139071] ? asm_exc_invalid_op+0x1a/0x20 > [ 177.143279] ? qla2xxx_dif_start_scsi_mq+0xcd8/0xce0 [qla2xxx] > [ 177.149165] ? internal_add_timer+0x42/0x70 > [ 177.153372] qla2xxx_mqueuecommand+0x207/0x2b0 [qla2xxx] > [ 177.158730] scsi_queue_rq+0x2b7/0xc00 > [ 177.162501] blk_mq_dispatch_rq_list+0x3ea/0x7e0 > [ 177.167143] __blk_mq_sched_dispatch_requests+0xac/0x670 > [ 177.172485] ? blk_rq_map_user_iov+0x2ae/0x690 > [ 177.176952] ? blk_mq_request_bypass_insert+0x74/0xa0 > [ 177.182031] blk_mq_sched_dispatch_requests+0x37/0x70 > [ 177.187110] blk_mq_run_hw_queue+0x183/0x1b0 > [ 177.191402] blk_execute_rq+0x103/0x230 > [ 177.195257] sg_io+0x17f/0x360 > [ 177.198327] scsi_ioctl_sg_io+0x69/0x90 > [ 177.202182] scsi_ioctl+0x4c6/0x890 > [ 177.205688] ? scsi_block_when_processing_errors+0x26/0xd0 > [ 177.211204] ? multipath_prepare_ioctl+0x50/0x130 [dm_multipath] > [ 177.217247] dm_blk_ioctl+0x72/0x120 [dm_mod] > [ 177.221637] blkdev_ioctl+0x1c2/0x280 > [ 177.225320] __x64_sys_ioctl+0x90/0xd0 > [ 177.229089] do_syscall_64+0x3b/0x90 > [ 177.232683] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > [ 177.237762] RIP: 0033:0x7f41ee4397cb > [ 177.241355] Code: 73 01 c3 48 8b 0d bd 56 38 00 f7 d8 64 89 01 48 > 83 > c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 > 0f > 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 56 38 00 f7 d8 64 89 01 > 48 > [ 177.260234] RSP: 002b:00007ffe44cf3578 EFLAGS: 00000246 ORIG_RAX: > 0000000000000010 > [ 177.267846] RAX: ffffffffffffffda RBX: 000055e1e25909a0 RCX: > 00007f41ee4397cb > [ 177.275018] RDX: 00007ffe44cf3580 RSI: 0000000000002285 RDI: > 0000000000000003 > [ 177.282191] RBP: 0000000000000003 R08: 0000000000000040 R09: > 000055e1e2590a50 > [ 177.289363] R10: 0000000000000000 R11: 0000000000000246 R12: > 0000000000000000 > [ 177.296535] R13: 00007ffe44cf3638 R14: 000055e1e25909a0 R15: > 00007ffe44cf3890 Hello Nilesh, This is not a final patchand will need a cleanup but something I came up with that will prevent the panic. You probably have better ideas. I have not signed it as its just a suggestion. [PATCH] scsi: qla2xxx avoid a panic due to BUG() if a command is sent to a device that has no protection. If a device does not have protection, qla2xx will land up defaulting to a BUG() and system panic. This is because SCSI_PROT_NORMAL is matched and the default used to be BUG(). This patch avoids the BUG() and prints a WARN diff --git a/drivers/scsi/qla2xxx/qla_iocb.c b/drivers/scsi/qla2xxx/qla_iocb.c index b9b3e6f80ea9..3fca7c7b7a92 100644 --- a/drivers/scsi/qla2xxx/qla_iocb.c +++ b/drivers/scsi/qla2xxx/qla_iocb.c @@ -1443,6 +1443,12 @@ qla24xx_build_scsi_crc_2_iocbs(srb_t *sp, struct cmd_type_crc_2 *cmd_pkt, dif_bytes = (data_bytes / blk_size) * 8; switch (scsi_get_prot_op(GET_CMD_SP(sp))) { + case SCSI_PROT_NORMAL: + total_bytes = data_bytes; + WARN(1, "device has no protection, command sent expecting\ + DIF or DIX protection with proto_op=%d", + cmd->prot_op); + break; case SCSI_PROT_READ_INSERT: case SCSI_PROT_WRITE_STRIP: total_bytes = data_bytes; sg_write_same -T --lba=1 /dev/mapper/mpathz1 [root@segstorage3 ~]# sg_write_same -T --lba=1 /dev/mapper/mpathz1 Write same: transport: Host_status=0x07 [DID_ERROR] Driver_status=0x00 [DRIVER_OK] Write same(32): Sense category: -1, try '-v' option for more information Some error occurred, try again with '-v' or '-vv' for more information segstorage3 login: [ 785.431935] ------------[ cut here ]------------ [ 785.436586] device has no protection, command sent expecting DIF or DIX protection with proto_op=0 [ 785.436635] WARNING: CPU: 39 PID: 20588 at drivers/scsi/qla2xxx/qla_iocb.c:1450 qla2xxx_dif_start_scsi_mq+0x4b4/0xd40 [qla2xxx] [ 785.534337] CPU: 39 PID: 20588 Comm: sg_write_same Kdump: loaded Tainted: G S W 6.4.0+ #1 [ 785.543782] Hardware name: HPE ProLiant DL380 Gen10/ProLiant DL380 Gen10, BIOS U30 05/17/2022 [ 785.552353] RIP: 0010:qla2xxx_dif_start_scsi_mq+0x4b4/0xd40 [qla2xxx] [ 785.558853] Code: b6 b0 98 00 00 00 48 c7 c7 e0 e9 79 c0 44 89 5c 24 74 89 44 24 70 44 89 4c 24 6c 4c 89 54 24 60 4c 89 44 24 50 e8 dc 67 9e c0 <0f> 0b 48 8b b5 98 00 00 00 44 8b 4c 24 6c 4c 8b 44 24 50 4c 8b 54 [ 785.577731] RSP: 0018:ffffc9000d527988 EFLAGS: 00010086 [ 785.582985] RAX: 0000000000000000 RBX: ffff8881160c6000 RCX: 0000000000000027 [ 785.590160] RDX: 0000000000000027 RSI: 00000000ffdfffff RDI: ffff88900dae0848 [ 785.597333] RBP: ffff88884be0f948 R08: 0000000000000000 R09: c0000000ffdfffff [ 785.604506] R10: 0000000000000001 R11: ffffc9000d527820 R12: 0000000000001c68 [ 785.611680] R13: ffff8881539d8d80 R14: ffff88813370eb40 R15: ffff888105910800 [ 785.618853] FS: 00007fd53861a740(0000) GS:ffff88900dac0000(0000) knlGS:0000000000000000 [ 785.626987] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 785.632762] CR2: 0000556d74bfabd8 CR3: 000000087696c003 CR4: 00000000007706e0 [ 785.639936] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 785.647110] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 785.654283] PKRU: 55555554 [ 785.657002] Call Trace: [ 785.659461] <TASK> [ 785.661571] ? __warn+0x85/0x140 [ 785.664819] ? qla2xxx_dif_start_scsi_mq+0x4b4/0xd40 [qla2xxx] [ 785.670709] ? report_bug+0xfc/0x1e0 [ 785.674306] ? handle_bug+0x3f/0x70 [ 785.677815] ? exc_invalid_op+0x17/0x70 [ 785.681669] ? asm_exc_invalid_op+0x1a/0x20 [ 785.685880] ? qla2xxx_dif_start_scsi_mq+0x4b4/0xd40 [qla2xxx] [ 785.691767] ? qla2xxx_dif_start_scsi_mq+0x4b4/0xd40 [qla2xxx] [ 785.697651] qla2xxx_mqueuecommand+0x207/0x2b0 [qla2xxx] [ 785.703007] scsi_queue_rq+0x2b7/0xc00 [ 785.706781] blk_mq_dispatch_rq_list+0x3ea/0x7e0 [ 785.711426] __blk_mq_sched_dispatch_requests+0xac/0x670 [ 785.716770] ? blk_rq_map_user_iov+0x2ae/0x690 [ 785.721238] ? blk_mq_request_bypass_insert+0x74/0xa0 [ 785.726317] blk_mq_sched_dispatch_requests+0x37/0x70 [ 785.731395] blk_mq_run_hw_queue+0x183/0x1b0 [ 785.735688] blk_execute_rq+0x103/0x230 [ 785.739545] sg_io+0x17f/0x360 [ 785.742614] scsi_ioctl_sg_io+0x69/0x90 [ 785.746470] scsi_ioctl+0x4c6/0x890 [ 785.749974] ? scsi_block_when_processing_errors+0x26/0xd0 [ 785.755489] ? multipath_prepare_ioctl+0x50/0x130 [dm_multipath] [ 785.761531] dm_blk_ioctl+0x72/0x120 [dm_mod] [ 785.765925] dm_blk_ioctl+0x72/0x120 [dm_mod] [ 785.770312] blkdev_ioctl+0x1c2/0x280 [ 785.773995] __x64_sys_ioctl+0x90/0xd0 [ 785.777767] do_syscall_64+0x3b/0x90 [ 785.781360] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 [ 785.786440] RIP: 0033:0x7fd537a397cb [ 785.790034] Code: 73 01 c3 48 8b 0d bd 56 38 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8d 56 38 00 f7 d8 64 89 01 48 [ 785.808912] RSP: 002b:00007ffdef6ef068 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 785.816524] RAX: ffffffffffffffda RBX: 0000556d74bf99a0 RCX: 00007fd537a397cb [ 785.823699] RDX: 00007ffdef6ef070 RSI: 0000000000002285 RDI: 0000000000000003 [ 785.830873] RBP: 0000000000000003 R08: 0000000000000040 R09: 0000556d74bf9a50 [ 785.838046] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 785.845221] R13: 00007ffdef6ef128 R14: 0000556d74bf99a0 R15: 00007ffdef6ef380 [ 785.852396] </TASK> [ 785.854590] ---[ end trace 0000000000000000 ]---