I have now built out several servers in an attempt to use LIO for my lab, but all recent attempts have met failure. I used a very similar setup in the past with a different lab so I know this all works, but that build used Ubuntu and an older kernel (Ubuntu 11.04). Here is a brief description of my test environment: Storage servers: IBM bladecenter using HS21 (8853) blades Disk backend: 10k SAS JBOD, RAID 1, 5 and 6 attempted Disk controllers: On-board LSI and Serveraid 8k (adaptec) tested Storage fabric: QLA2462 equivalent (IBM branded) 4gb FC cards (PCI-X based) OS: Fedora 19, 20 and 21 using multiple kernels, the latest being Kernel 3.19.3 I'm able to get things up and running, but as soon as the ESXi hosts start using the shared storage, the server blade crashes (it actually seems to lock up). I've also tried multiple versions of ESXi, just in case. To ensure the problem wasn't with the server I ran multiple loops of bonnie++, used Linux "stress", and ran memtest+ on the server for over 24 hours, with no failures. I found reports of similar problems in the past and it was suggested that the problem was with the performance of the backend storage. As unlikely as I expect this to be with my storage, here are the results of bonnie++ on my RAID 6 array, hopefully one of you are better at interpreting them than I am. [root@labsan1 bonnie]# bonnie++ -d /bonnie -r 4096 -s 32G -n 0 -m TEST -f -b -n 128 -u root Version 1.96 ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP TEST 32G 336366 47 80745 12 352922 28 542.4 8 Latency 348ms 297ms 42853us 224ms Version 1.96 ------Sequential Create------ --------Random Create-------- TEST -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 128 1786 9 +++++ +++ 1033 5 1769 9 +++++ +++ 984 5 Latency 286ms 478us 448ms 282ms 32us 597ms I've seen quite a few different errors, but the following are some of the most common, but please let me know what I should do to collect the best possible logs for troubleshooting the problem. ------------------------------------------------------------------------------------------SNIP-------------------------------------------------------------------------------------------------------------- Message from syslogd@labsan1 at Feb 5 23:49:01 ... kernel:[17184.127000] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [systemd-udevd:1602] ------------------------------------------------------------------------------------------/SNIP-------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------SNIP-------------------------------------------------------------------------------------------------------------- Apr 3 00:24:34 labsan1 kernel: rport-5:0-9: blocked FC remote port time out: removing rport Apr 3 00:24:34 labsan1 kernel: rport-4:0-9: blocked FC remote port time out: removing rport Apr 3 00:24:34 labsan1 kernel: rport-6:0-9: blocked FC remote port time out: removing rport Apr 3 00:24:34 labsan1 kernel: rport-3:0-9: blocked FC remote port time out: removing rport Apr 3 00:26:40 labsan1 kernel: [41095.711611] MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Apr 3 00:26:40 labsan1 kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Apr 3 00:34:03 labsan1 kernel: [41538.771168] Detected MISCOMPARE for addr: ffff8800caeb7000 buf: ffff8800c9896e00 Apr 3 00:34:03 labsan1 kernel: [41538.771173] Target/iblock: Send MISCOMPARE check condition and sense Apr 3 00:34:03 labsan1 kernel: Detected MISCOMPARE for addr: ffff8800caeb7000 buf: ffff8800c9896e00 Apr 3 00:34:03 labsan1 kernel: Target/iblock: Send MISCOMPARE check condition and sense Apr 3 00:34:48 labsan1 kernel: [41584.159441] ABORT_TASK: Found referenced qla2xxx task_tag: 1170576 Apr 3 00:34:48 labsan1 kernel: [41584.159446] ABORT_TASK: ref_tag: 1170576 already complete, skipping Apr 3 00:34:48 labsan1 kernel: [41584.159448] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576 Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1170576 Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: ref_tag: 1170576 already complete, skipping Apr 3 00:34:48 labsan1 kernel: ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576 Apr 3 00:34:49 labsan1 kernel: [41585.156976] ABORT_TASK: Found referenced qla2xxx task_tag: 1171236 Apr 3 00:34:49 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1171236 Apr 3 00:34:50 labsan1 kernel: [41586.351683] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1171236 Apr 3 00:34:50 labsan1 kernel: [41586.351691] ABORT_TASK: Found referenced qla2xxx task_tag: 1172644 Apr 3 00:34:50 labsan1 kernel: ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1171236 Apr 3 00:34:50 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx task_tag: 1172644 Apr 3 00:34:50 labsan1 kernel: [41586.472423] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1172644 Apr 3 00:34:50 labsan1 kernel: [41586.472432] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1147388 Apr 3 00:34:50 labsan1 kernel: [41586.472438] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171368 Apr 3 00:34:50 labsan1 kernel: [41586.472441] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171456 Apr 3 00:34:50 labsan1 kernel: [41586.472445] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171500 ------------------------------------------------------------------------------------------/SNIP------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------SNIP-------------------------------------------------------------------------------------------------------------- Apr 3 02:04:40 labsan1 kernel: [ 2282.079013] rport-6:0-5: blocked FC remote port time out: no longer a FCP target, removing starget Apr 3 02:04:40 labsan1 kernel: rport-6:0-5: blocked FC remote port time out: no longer a FCP target, removing starget Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] ------------[ cut here ]------------ Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] WARNING: CPU: 3 PID: 0 at kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0() Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Watchdog detected hard LOCKUP on cpu 3 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Modules linked in: tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security i p6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip table_security iptable_raw coretemp kvm_intel kvm iTCO_wdt iTCO_vendor_support gpio_ich ipmi_ssif ipmi_devintf lpc_ich mfd_core i5000 _edac serio_raw edac_core ses ioatdma enclosure i5k_amb shpchp dca ipmi_si acpi_cpufreq ipmi_msghandler nfsd auth_rpcgss nfs_acl lock d grace sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm qla2xxx bnx2 ata_generic pata_acpi scsi_transport_fc aacraid Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.19.3-200.fc21.x86_64 #1 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Hardware name: IBM IBM eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS -[BCE14 8BUS-1.21]- 04/04/2011 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000 ef4cea677ede47e6 ffff88012fd85a60 ffffffff8176e215 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000 ffff88012fd85ab8 ffff88012fd85aa0 ffffffff8109bc1a Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] 0000000000000000 ffff88012a510000 0000000000000000 ffff88012fd85c00 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] Call Trace: Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <NMI> [<ffffffff8176e215>] dump_stack+0x45/0x57 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8109bc1a>] warn_slowpath_common+0x8a/0xc0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8109bca5>] warn_slowpath_fmt+0x55/0x70 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81150ad2>] watchdog_overflow_callback+0x92/0xc0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8119191b>] __perf_event_overflow+0x9b/0x250 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81192434>] perf_event_overflow+0x14/0x20 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8103460a>] intel_pmu_handle_irq+0x1da/0x3f0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8102bafb>] perf_event_nmi_handler+0x2b/0x50 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81018fd8>] nmi_handle+0x88/0x130 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81019562>] default_do_nmi+0x42/0x110 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810196b8>] do_nmi+0x88/0xd0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81776d21>] end_repeat_nmi+0x1e/0x2e Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ? _raw_spin_lock_irqsave+0x4a/0x60 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ? _raw_spin_lock_irqsave+0x4a/0x60 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177448a>] ? _raw_spin_lock_irqsave+0x4a/0x60 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <<EOE>> <IRQ> [<ffffffffa00eade2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx] Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa008cc13>] qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx] Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa00ac8a9>] qla2x00_async_event+0xe39/0x1890 [qla2xxx] Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810cb828>] ? sched_clock_cpu+0x88/0xb0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810c04ca>] ? update_rq_clock.part.78+0x1a/0xe0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810cf9c3>] ? update_blocked_averages+0x2f3/0x7a0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffffa00ade41>] qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx] Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f2a47>] handle_irq_event_percpu+0x77/0x1a0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f2bab>] handle_irq_event+0x3b/0x60 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8101e8ca>] ? native_sched_clock+0x2a/0x90 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810f5a99>] handle_fasteoi_irq+0x79/0x120 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff81017414>] handle_irq+0x74/0x140 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810bb54a>] ? atomic_notifier_call_chain+0x1a/0x20 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177777f>] do_IRQ+0x4f/0xf0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8177556d>] common_interrupt+0x6d/0x6d Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] <EOI> [<ffffffff81103c98>] ? hrtimer_start+0x18/0x20 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8105ea56>] ? native_safe_halt+0x6/0x10 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810fabb3>] ? rcu_eqs_enter+0xa3/0xb0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8101f97e>] default_idle+0x1e/0xc0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8102034f>] arch_cpu_idle+0xf/0x20 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff810de15a>] cpu_startup_entry+0x37a/0x3c0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] [<ffffffff8104af3a>] start_secondary+0x1aa/0x1f0 Apr 3 02:05:10 labsan1 kernel: [ 2252.121715] ---[ end trace 05113eaf8d7e2e9d ]--- Apr 3 02:05:10 labsan1 kernel: [ 2312.441004] INFO: rcu_sched detected stalls on CPUs/tasks: { 3} (detected by 0, t=60003 jiffies, g=3933, c=3932, q=0) Apr 3 02:05:10 labsan1 kernel: [ 2312.441207] Task dump for CPU 3: Apr 3 02:05:10 labsan1 kernel: [ 2312.441211] swapper/3 R running task 0 0 1 0x00000008 Apr 3 02:05:10 labsan1 kernel: [ 2312.441215] 0000000000000000 0000000000000000 ffffffffffffff1e ffffffff8105ea56 Apr 3 02:05:10 labsan1 kernel: [ 2312.441219] 0000000000000010 0000000000000246 ffff88012abf7e88 0000000000000018 Apr 3 02:05:10 labsan1 kernel: [ 2312.441222] ffffffff810fabb3 ffff88012abf7ea8 ffffffff8101f97e ffffffff81d2a6c0 Apr 3 02:05:10 labsan1 kernel: [ 2312.441226] Call Trace: Apr 3 02:05:10 labsan1 kernel: [ 2312.441237] [<ffffffff8105ea56>] ? native_safe_halt+0x6/0x10 Apr 3 02:05:10 labsan1 kernel: [ 2312.441242] [<ffffffff810fabb3>] ? rcu_eqs_enter+0xa3/0xb0 Apr 3 02:05:10 labsan1 kernel: [ 2312.441247] [<ffffffff8101f97e>] ? default_idle+0x1e/0xc0 Apr 3 02:05:10 labsan1 kernel: [ 2312.441251] [<ffffffff8102034f>] ? arch_cpu_idle+0xf/0x20 Apr 3 02:05:10 labsan1 kernel: [ 2312.441254] [<ffffffff810de15a>] ? cpu_startup_entry+0x37a/0x3c0 Apr 3 02:05:10 labsan1 kernel: [ 2312.441259] [<ffffffff8104af3a>] ? start_secondary+0x1aa/0x1f0 Apr 3 02:05:10 labsan1 kernel: ------------[ cut here ]------------ Apr 3 02:05:10 labsan1 kernel: WARNING: CPU: 3 PID: 0 at kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0() Apr 3 02:05:10 labsan1 kernel: Watchdog detected hard LOCKUP on cpu 3 Apr 3 02:05:10 labsan1 kernel: Modules linked in: Apr 3 02:05:16 labsan1 kernel: [ 2318.111331] ABORT_TASK: Found referenced qla2xxx task_tag: 1192752 Apr 3 02:05:30 labsan1 kernel: [ 2332.303075] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1192752 Apr 3 02:05:30 labsan1 kernel: [ 2332.303084] ABORT_TASK: Found referenced qla2xxx task_tag: 1199704 ------------------------------------------------------------------------------------------/SNIP-------------------------------------------------------------------------------------------------------------- Thanks, Dan -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html