Re: Continuously crashes in kernel 4.5.2

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Edoardo,

Apologies for the delayed follow up on your bug report.

Comments inline below.

On Wed, 2016-05-04 at 10:38 +0200, Edoardo wrote:
> Hi all,
> I'm having troubles in my iSCSI target server after the update to
> linux-4.5.2. I've always had some trouble, but not this bad.
> This time the kernel simply crashes printing out “fixing recursive
> fault but reboot is needed”, but reboot is actually the only thing I
> can do.
> Fortunately I was able to catch the dmesg output thanks to a remote
> syslog server.
> Can you help me sort this out?
> I'm also testing on btrfs filesystem, so some troubles may come from there.
> 
> I attached my saved targetcli configuration, and paste the info and
> the dmesg output
> 
> uname -a :
> Linux gentoo-SMB1 4.5.2-gentoo #2 SMP Tue Apr 26 11:36:10 CEST 2016
> x86_64 Intel(R) Pentium(R) CPU G3220 @ 3.00GHz GenuineIntel GNU/Linux
> 

In your lio_start.sh, all /sys/kernel/config/target/iscsi/$IQN/$TPGT/
endpoints are changing parameter defaults ErrorRecoveryLevel=0 to
ErrorRecoveryLevel=2 for the two active iscsi-target exports.

The reason we default to ERL=0 is because MSFT initiators have long had
problems not following the ERL=2's connection recovery state machine in
RFC-3720, resulting in hung scsi miniport I/O and other MSFT host side
issues.

Of course, the type of memory allocation failure you've observed below
should not be triggering a target OOPsen, but for getting stable
MSFT iSCSI host setup you really need to be using ERL=0 defaults for all
exports.

> [ 1154.103989] ignoring deprecated emulate_fua_read attribute
> [ 1154.104021] ignoring deprecated emulate_dpo attribute
> [ 1709.899686] Unable to locate ITT: 0xad891600 on CID: 1, dumping payload
> [ 1709.899750] Unable to locate ITT: 0xae891600 on CID: 1, dumping payload
> [ 1709.899792] Unable to locate ITT: 0xaf891600 on CID: 1, dumping payload
> [ 1709.899841] Unable to locate ITT: 0xb0891600 on CID: 1, dumping payload
> [ 1709.899856] Unable to locate ITT: 0xb1891600 on CID: 1, dumping payload
> [ 1710.138608] Unable to locate ITT: 0xb2891600 on CID: 1, dumping payload
> [ 1714.873446] Unable to locate ITT: 0x5b8a1600 on CID: 1, dumping payload
> [ 1714.873513] Unable to locate ITT: 0x5c8a1600 on CID: 1, dumping payload
> [ 1714.873644] Unable to locate ITT: 0x5d8a1600 on CID: 1, dumping payload
> [ 1714.873689] Unable to locate ITT: 0x5e8a1600 on CID: 1, dumping payload
> [ 1714.876817] Unable to locate ITT: 0x5f8a1600 on CID: 1, dumping payload
> [ 1752.823610] Unable to locate ITT: 0xfd8e1600 on CID: 1, dumping payload
> [ 1774.958443] Unable to locate ITT: 0x07911600 on CID: 1, dumping payload
> [ 1774.960190] Unable to locate ITT: 0x08911600 on CID: 1, dumping payload
> [ 1774.961896] Unable to locate ITT: 0x09911600 on CID: 1, dumping payload
> [ 1774.963396] Unable to locate ITT: 0x0a911600 on CID: 1, dumping payload
> [ 1774.965143] Unable to locate ITT: 0x0b911600 on CID: 1, dumping payload
> [ 1774.966932] Unable to locate ITT: 0x0c911600 on CID: 1, dumping payload
> [ 1774.968395] Unable to locate ITT: 0x0d911600 on CID: 1, dumping payload
> [ 1864.857610] Unable to locate ITT: 0x54951600 on CID: 1, dumping payload
> [ 1864.859108] Unable to locate ITT: 0x55951600 on CID: 1, dumping payload
> [ 1868.897251] Unable to locate ITT: 0x73951600 on CID: 1, dumping payload
> [ 1872.249432] Unable to locate ITT: 0x90951600 on CID: 1, dumping payload
> [ 1872.249436] iscsi_trx: page allocation failure: order:3, mode:0x208c020
> [ 1872.249439] CPU: 0 PID: 5031 Comm: iscsi_trx Not tainted 4.5.2-gentoo #2
> [ 1872.249440] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013

<SNIP>

> [ 1872.249556] Unable to allocate 32768 bytes for offload buffer.
> [ 1872.249559] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249560] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249561] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249563] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249564] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249565] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249566] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249567] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249568] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249569] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249570] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249571] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249573] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249574] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249576] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249577] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249578] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249579] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249583] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249584] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249585] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249586] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249587] NopOUT Flag's, Left Most Bit not set, protocol error.
> [ 1872.249588] Got unknown iSCSI OpCode: 0x5b
> [ 1872.249590] Unable to recover from unknown opcode while OFMarker=No, closing iSCSI connection.
> [ 1903.106918] Unable to locate ITT: 0xea961600 on CID: 1, dumping payload
> [ 1903.106974] Unable to locate ITT: 0xeb961600 on CID: 1, dumping payload
> [ 1903.107017] Unable to locate ITT: 0xec961600 on CID: 1, dumping payload
> [ 1903.107054] Unable to locate ITT: 0xed961600 on CID: 1, dumping payload
> [ 1915.539126] Unable to locate ITT: 0x90971600 on CID: 1, dumping payload
> [ 1915.539132] iscsi_trx: page allocation failure: order:3, mode:0x208c020
> [ 1915.539135] CPU: 0 PID: 5284 Comm: iscsi_trx Not tainted 4.5.2-gentoo #2
> [ 1915.539136] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013

<SNIP>

> [ 1915.539269] Unable to allocate 32768 bytes for offload buffer.

It's strange that smallish order:3 memory allocations begin to fail this
early..

> [ 1915.539271] Got unknown iSCSI OpCode: 0x33
> [ 1915.539272] Unable to recover from unknown opcode while OFMarker=No, closing iSCSI connection.
> [ 1915.547337] BUG: unable to handle kernel paging request at ffffc900017e4100
> [ 1915.547478] IP: [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod]
> [ 1915.547592] PGD 21608a067 PUD 21608b067 PMD d9520067 PTE 0
> [ 1915.547805] Oops: 0000 [#1] SMP
> [ 1915.547933] Modules linked in: tcm_loop iscsi_target_mod
> target_core_pscsi target_core_file target_core_iblock target_core_mod
> kvm_intel kvm irqbypass crc32c_intel e1000e [last unloaded:
> target_core_mod]
> [ 1915.548539] CPU: 1 PID: 4150 Comm: kworker/1:1 Not tainted 4.5.2-gentoo #2
> [ 1915.548597] Hardware name: Dell Inc. PowerEdge T20/0VD5HY, BIOS A01 08/13/2013
> [ 1915.548662] Workqueue: events iscsi_target_do_login_rx [iscsi_target_mod]
> [ 1915.548759] task: ffff8801a6d1bf00 ti: ffff8801d95d4000 task.ti: ffff8801d95d4000
> [ 1915.548817] RIP: 0010:[<ffffffffa03cfbf4>]  [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod]
> [ 1915.548933] RSP: 0018:ffff8801d95d7c98  EFLAGS: 00010246
> [ 1915.548987] RAX: ffff8800b68ad858 RBX: ffff8801c96e3000 RCX: ffffc900017e4100
> [ 1915.549043] RDX: 0000000000000001 RSI: 00000001800c000b RDI: ffff8800b68ad868
> [ 1915.549100] RBP: ffff8801c96e3070 R08: 0000000000000001 R09: ffffffffa03db052
> [ 1915.549197] R10: ffffea0000b54000 R11: 0000000000100001 R12: ffff8801c96e3110
> [ 1915.549294] R13: ffff8800b68ad868 R14: ffff8800b68ad840 R15: ffffc900017e3f00
> [ 1915.549392] FS:  0000000000000000(0000) GS:ffff88021eb00000(0000) knlGS:0000000000000000
> [ 1915.549531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1915.549625] CR2: ffffc900017e4100 CR3: 00000000c0ad1000 CR4: 00000000000406a0
> [ 1915.549721] Stack:
> [ 1915.549806]  ffff8800b68ad858 ffff8801c96e30f8 ffff880211a96000 ffff8801c96e3000
> [ 1915.550103]  ffff880061b4f050 ffff880061b4f000 ffff880061b4f050 ffff880214ec99c0
> [ 1915.550398]  ffff880211a96000 ffffffffa03db06a ffffffffa038845c ffff88002d500a74
> [ 1915.550695] Call Trace:
> [ 1915.550788]  [<ffffffffa03db06a>] ? iscsit_close_session+0xe5/0x164 [iscsi_target_mod]
> [ 1915.550936]  [<ffffffffa038845c>] ? atomic_dec_mb+0x4/0x4 [target_core_mod]
> [ 1915.551039]  [<ffffffffa0388bf5>] ? kref_put+0x2e/0x36 [target_core_mod]
> [ 1915.551142]  [<ffffffffa03d0676>] ? iscsi_check_for_session_reinstatement+0x1bf/0x1d0 [iscsi_target_mod]
> [ 1915.551290]  [<ffffffffa03d24d4>] ? iscsi_target_do_login+0x2f9/0x4e8 [iscsi_target_mod]
> [ 1915.551434]  [<ffffffffa03d315f>] ? iscsi_target_do_login_rx+0x165/0x1e9 [iscsi_target_mod]
> [ 1915.551578]  [<ffffffffa03d1e67>] ? iscsi_target_restore_sock_callbacks+0x8a/0x8a [iscsi_target_mod]
> [ 1915.551722]  [<ffffffff8104c608>] ? process_one_work+0x19c/0x2b9
> [ 1915.551818]  [<ffffffff8104ca5d>] ? worker_thread+0x1d9/0x2ad
> [ 1915.551912]  [<ffffffff8104c884>] ? cancel_delayed_work_sync+0xa/0xa
> [ 1915.552010]  [<ffffffff810507b7>] ? kthread+0x95/0x9d
> [ 1915.552104]  [<ffffffff81050722>] ? kthread_parkme+0x16/0x16
> [ 1915.552204]  [<ffffffff81538d5f>] ? ret_from_fork+0x3f/0x70
> [ 1915.552293]  [<ffffffff81050722>] ? kthread_parkme+0x16/0x16
> [ 1915.552382] Code: 86 90 00 00 00 c6 83 10 01 00 00 00 4d 8d 6e 28
> 4c 89 ef e8 2e 8c 16 e1 49 8b 4e 18 49 8d 46 18 48 89 04 24 4c 8d b9
> 00 fe ff ff <48> 8b 09 48 81 e9 00 02 00 00 49 8d bf 00 02 00 00 48 3b
> 3c 24
> [ 1915.555105] RIP  [<ffffffffa03cfbf4>] iscsit_free_connection_recovery_entires+0x1d9/0x278 [iscsi_target_mod]
> [ 1915.555283]  RSP <ffff8801d95d7c98>
> [ 1915.555367] CR2: ffffc900017e4100
> [ 1915.555451] ---[ end trace 405a6f5266a8bf99 ]---
> 

Ok, so after dumping repeated TCP payloads and two memory allocation
failures, iscsi login eventually hits a kernel paging OOPs after
subsequent ERL=2 connection login failure occurs.

I still need to ponder the scenario for a proper bug-fix might work, but
the issue itself is ERL=2 specific and does not effect ERL=0 operation.

So as a work-around, go ahead and change to ERL=0 in your lio_start.sh
for both IQN+TargetPortalGroupTag endpoints.  Note you'll need to force
reconnect all MSFT sessions after changing this configfs attribute for
the new values to take effect.

> 
> Thanks for the help,
> --
> Edoardo Liverani
> --

Thanks for reporting!

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux