Hi Edoardo, Apologies for the delayed follow up on your bug report. Comments inline below. On Wed, 2016-05-04 at 10:38 +0200, Edoardo wrote: > Hi all, > I'm having troubles in my iSCSI target server after the update to > linux-4.5.2. I've always had some trouble, but not this bad. > This time the kernel simply crashes printing out “fixing recursive > fault but reboot is needed”, but reboot is actually the only thing I > can do. > Fortunately I was able to catch the dmesg output thanks to a remote > syslog server. > Can you help me sort this out? > I'm also testing on btrfs filesystem, so some troubles may come from there. > > I attached my saved targetcli configuration, and paste the info and > the dmesg output > > uname -a : > Linux gentoo-SMB1 4.5.2-gentoo #2 SMP Tue Apr 26 11:36:10 CEST 2016 > x86_64 Intel(R) Pentium(R) CPU G3220 @ 3.00GHz GenuineIntel GNU/Linux > In your lio_start.sh, all /sys/kernel/config/target/iscsi/$IQN/$TPGT/ endpoints are changing parameter defaults ErrorRecoveryLevel=0 to ErrorRecoveryLevel=2 for the two active iscsi-target exports. The reason we default to ERL=0 is because MSFT initiators have long had problems not following the ERL=2's connection recovery state machine in RFC-3720, resulting in hung scsi miniport I/O and other MSFT host side issues. Of course, the type of memory allocation failure you've observed below should not be triggering a target OOPsen, but for getting stable MSFT iSCSI host setup you really need to be using ERL=0 defaults for all exports. > [ 1154.103989] ignoring deprecated emulate_fua_read attribute > [ 1154.104021] ignoring deprecated emulate_dpo attribute > [ 1709.899686] Unable to locate ITT: 0xad891600 on CID: 1, dumping payload > [ 1709.899750] Unable to locate ITT: 0xae891600 on CID: 1, dumping payload > [ 1709.899792] Unable to locate ITT: 0xaf891600 on CID: 1, dumping payload > [ 1709.899841] Unable to locate ITT: 0xb0891600 on CID: 1, dumping payload > [ 1709.899856] Unable to locate ITT: 0xb1891600 on CID: 1, dumping payload > [ 1710.138608] Unable to locate ITT: 0xb2891600 on CID: 1, dumping payload > [ 1714.873446] Unable to locate ITT: 0x5b8a1600 on CID: 1, dumping payload > [ 1714.873513] Unable to locate ITT: 0x5c8a1600 on CID: 1, dumping payload > [ 1714.873644] Unable to locate ITT: 0x5d8a1600 on CID: 1, dumping payload > [ 1714.873689] Unable to locate ITT: 0x5e8a1600 on CID: 1, dumping payload > [ 1714.876817] Unable to locate ITT: 0x5f8a1600 on CID: 1, dumping payload > [ 1752.823610] Unable to locate ITT: 0xfd8e1600 on CID: 1, dumping payload > [ 1774.958443] Unable to locate ITT: 0x07911600 on CID: 1, dumping payload > [ 1774.960190] Unable to locate ITT: 0x08911600 on CID: 1, dumping payload > [ 1774.961896] Unable to locate ITT: 0x09911600 on CID: 1, dumping payload > [ 1774.963396] Unable to locate ITT: 0x0a911600 on CID: 1, dumping payload > [ 1774.965143] Unable to locate ITT: 0x0b911600 on CID: 1, dumping payload > [ 1774.966932] Unable to locate ITT: 0x0c911600 on CID: 1, dumping payload > [ 1774.968395] Unable to locate ITT: 0x0d911600 on CID: 1, dumping payload > [ 1864.857610] Unable to locate ITT: 0x54951600 on CID: 1, dumping payload > [ 1864.859108] Unable to locate ITT: 0x55951600 on CID: 1, dumping payload > [ 1868.897251] Unable to locate ITT: 0x73951600 on CID: 1, dumping payload > [ 1872.249432] Unable to locate ITT: 0x90951600 on CID: 1, dumping payload > [ 1872.249436] iscsi_trx: page allocation failure: order:3, mode:0x208c020 > [ 1872.249439] CPU: 0 PID: 5031 Comm: iscsi_trx Not tainted 4.5.2-gentoo #2 > [ 1872.249440] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013 <SNIP> > [ 1872.249556] Unable to allocate 32768 bytes for offload buffer. > [ 1872.249559] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249560] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249561] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249563] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249564] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249565] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249566] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249567] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249568] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249569] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249570] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249571] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249573] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249574] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249576] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249577] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249578] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249579] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249583] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249584] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249585] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249586] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249587] NopOUT Flag's, Left Most Bit not set, protocol error. > [ 1872.249588] Got unknown iSCSI OpCode: 0x5b > [ 1872.249590] Unable to recover from unknown opcode while OFMarker=No, closing iSCSI connection. > [ 1903.106918] Unable to locate ITT: 0xea961600 on CID: 1, dumping payload > [ 1903.106974] Unable to locate ITT: 0xeb961600 on CID: 1, dumping payload > [ 1903.107017] Unable to locate ITT: 0xec961600 on CID: 1, dumping payload > [ 1903.107054] Unable to locate ITT: 0xed961600 on CID: 1, dumping payload > [ 1915.539126] Unable to locate ITT: 0x90971600 on CID: 1, dumping payload > [ 1915.539132] iscsi_trx: page allocation failure: order:3, mode:0x208c020 > [ 1915.539135] CPU: 0 PID: 5284 Comm: iscsi_trx Not tainted 4.5.2-gentoo #2 > [ 1915.539136] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013 <SNIP> > [ 1915.539269] Unable to allocate 32768 bytes for offload buffer. It's strange that smallish order:3 memory allocations begin to fail this early.. > [ 1915.539271] Got unknown iSCSI OpCode: 0x33 > [ 1915.539272] Unable to recover from unknown opcode while OFMarker=No, closing iSCSI connection. > [ 1915.547337] BUG: unable to handle kernel paging request at ffffc900017e4100 > [ 1915.547478] IP: [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod] > [ 1915.547592] PGD 21608a067 PUD 21608b067 PMD d9520067 PTE 0 > [ 1915.547805] Oops: 0000 [#1] SMP > [ 1915.547933] Modules linked in: tcm_loop iscsi_target_mod > target_core_pscsi target_core_file target_core_iblock target_core_mod > kvm_intel kvm irqbypass crc32c_intel e1000e [last unloaded: > target_core_mod] > [ 1915.548539] CPU: 1 PID: 4150 Comm: kworker/1:1 Not tainted 4.5.2-gentoo #2 > [ 1915.548597] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013 > [ 1915.548662] Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod] > [ 1915.548759] task: ffff8801a6d1bf00 ti: ffff8801d95d4000 task.ti: ffff8801d95d4000 > [ 1915.548817] RIP: 0010:[<ffffffffa03cfbf4>] [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod] > [ 1915.548933] RSP: 0018:ffff8801d95d7c98 EFLAGS: 00010246 > [ 1915.548987] RAX: ffff8800b68ad858 RBX: ffff8801c96e3000 RCX: ffffc900017e4100 > [ 1915.549043] RDX: 0000000000000001 RSI: 00000001800c000b RDI: ffff8800b68ad868 > [ 1915.549100] RBP: ffff8801c96e3070 R08: 0000000000000001 R09: ffffffffa03db052 > [ 1915.549197] R10: ffffea0000b54000 R11: 0000000000100001 R12: ffff8801c96e3110 > [ 1915.549294] R13: ffff8800b68ad868 R14: ffff8800b68ad840 R15: ffffc900017e3f00 > [ 1915.549392] FS: 0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000 > [ 1915.549531] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1915.549625] CR2: ffffc900017e4100 CR3: 00000000c0ad1000 CR4: 00000000000406a0 > [ 1915.549721] Stack: > [ 1915.549806] ffff8800b68ad858 ffff8801c96e30f8 ffff880211a96000 ffff8801c96e3000 > [ 1915.550103] ffff880061b4f050 ffff880061b4f000 ffff880061b4f050 ffff880214ec99c0 > [ 1915.550398] ffff880211a96000 ffffffffa03db06a ffffffffa038845c ffff88002d500a74 > [ 1915.550695] Call Trace: > [ 1915.550788] [<ffffffffa03db06a>] ? iscsit_close_session+0xe5/0x164 [iscsi_target_mod] > [ 1915.550936] [<ffffffffa038845c>] ? atomic_dec_mb+0x4/0x4 [target_core_mod] > [ 1915.551039] [<ffffffffa0388bf5>] ? kref_put+0x2e/0x36 [target_core_mod] > [ 1915.551142] [<ffffffffa03d0676>] ? iscsi_check_for_session_reinstatement+0x1bf/0x1d0 [iscsi_target_mod] > [ 1915.551290] [<ffffffffa03d24d4>] ? iscsi_target_do_login+0x2f9/0x4e8 [iscsi_target_mod] > [ 1915.551434] [<ffffffffa03d315f>] ? iscsi_target_do_login_rx+0x165/0x1e9 [iscsi_target_mod] > [ 1915.551578] [<ffffffffa03d1e67>] ? iscsi_target_restore_sock_callbacks+0x8a/0x8a [iscsi_target_mod] > [ 1915.551722] [<ffffffff8104c608>] ? process_one_work+0x19c/0x2b9 > [ 1915.551818] [<ffffffff8104ca5d>] ? worker_thread+0x1d9/0x2ad > [ 1915.551912] [<ffffffff8104c884>] ? cancel_delayed_work_sync+0xa/0xa > [ 1915.552010] [<ffffffff810507b7>] ? kthread+0x95/0x9d > [ 1915.552104] [<ffffffff81050722>] ? kthread_parkme+0x16/0x16 > [ 1915.552204] [<ffffffff81538d5f>] ? ret_from_fork+0x3f/0x70 > [ 1915.552293] [<ffffffff81050722>] ? kthread_parkme+0x16/0x16 > [ 1915.552382] Code: 86 90 00 00 00 c6 83 10 01 00 00 00 4d 8d 6e 28 > 4c 89 ef e8 2e 8c 16 e1 49 8b 4e 18 49 8d 46 18 48 89 04 24 4c 8d b9 > 00 fe ff ff <48> 8b 09 48 81 e9 00 02 00 00 49 8d bf 00 02 00 00 48 3b > 3c 24 > [ 1915.555105] RIP [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod] > [ 1915.555283] RSP <ffff8801d95d7c98> > [ 1915.555367] CR2: ffffc900017e4100 > [ 1915.555451] ---[ end trace 405a6f5266a8bf99 ]--- > Ok, so after dumping repeated TCP payloads and two memory allocation failures, iscsi login eventually hits a kernel paging OOPs after subsequent ERL=2 connection login failure occurs. I still need to ponder the scenario for a proper bug-fix might work, but the issue itself is ERL=2 specific and does not effect ERL=0 operation. So as a work-around, go ahead and change to ERL=0 in your lio_start.sh for both IQN+TargetPortalGroupTag endpoints. Note you'll need to force reconnect all MSFT sessions after changing this configfs attribute for the new values to take effect. > > Thanks for the help, > -- > Edoardo Liverani > -- Thanks for reporting! -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html