LIO crashing Fedora box, multiple versions and kernels tested

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have now built out several servers in an attempt to use LIO for my
lab, but all recent attempts have met failure.  I used a very similar
setup in the past with a different lab so I know this all works, but
that build used Ubuntu and an older kernel (Ubuntu 11.04).  Here is a
brief description of my test environment:

Storage servers: IBM bladecenter using HS21 (8853) blades
Disk backend: 10k SAS JBOD, RAID 1, 5 and 6 attempted
Disk controllers: On-board LSI and Serveraid 8k (adaptec) tested
Storage fabric: QLA2462 equivalent (IBM branded) 4gb FC cards (PCI-X based)
OS: Fedora 19, 20 and 21 using multiple kernels, the latest being Kernel 3.19.3

I'm able to get things up and running, but as soon as the ESXi hosts
start using the shared storage, the server blade crashes (it actually
seems to lock up).  I've also tried multiple versions of ESXi, just in
case.  To ensure the problem wasn't with the server I ran multiple
loops of bonnie++, used Linux "stress", and ran memtest+ on the server
for over 24 hours, with no failures.  I found reports of similar
problems in the past and it was suggested that the problem was with
the performance of the backend storage.  As unlikely as I expect this
to be with my storage, here are the results of bonnie++ on my RAID 6
array, hopefully one of you are better at interpreting them than I am.

[root@labsan1 bonnie]# bonnie++ -d /bonnie -r 4096 -s 32G -n 0 -m TEST
-f -b -n 128 -u root
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
TEST            32G           336366  47 80745  12           352922
28 542.4   8
Latency                         348ms     297ms             42853us     224ms
Version  1.96       ------Sequential Create------ --------Random Create--------
TEST                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                128  1786   9 +++++ +++  1033   5  1769   9 +++++ +++   984   5
Latency               286ms     478us     448ms     282ms      32us     597ms

I've seen quite a few different errors, but the following are some of
the most common, but please let me know what I should do to collect
the best possible logs for troubleshooting the problem.

------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Message from syslogd@labsan1 at Feb  5 23:49:01 ...
 kernel:[17184.127000] NMI watchdog: BUG: soft lockup - CPU#3 stuck
for 22s! [systemd-udevd:1602]
------------------------------------------------------------------------------------------/SNIP--------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Apr  3 00:24:34 labsan1 kernel: rport-5:0-9: blocked FC remote port
time out: removing rport
Apr  3 00:24:34 labsan1 kernel: rport-4:0-9: blocked FC remote port
time out: removing rport
Apr  3 00:24:34 labsan1 kernel: rport-6:0-9: blocked FC remote port
time out: removing rport
Apr  3 00:24:34 labsan1 kernel: rport-3:0-9: blocked FC remote port
time out: removing rport
Apr  3 00:26:40 labsan1 kernel: [41095.711611] MODE SENSE:
unimplemented page/subpage: 0x1c/0x02
Apr  3 00:26:40 labsan1 kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Apr  3 00:34:03 labsan1 kernel: [41538.771168] Detected MISCOMPARE for
addr: ffff8800caeb7000 buf: ffff8800c9896e00
Apr  3 00:34:03 labsan1 kernel: [41538.771173] Target/iblock: Send
MISCOMPARE check condition and sense
Apr  3 00:34:03 labsan1 kernel: Detected MISCOMPARE for addr:
ffff8800caeb7000 buf: ffff8800c9896e00
Apr  3 00:34:03 labsan1 kernel: Target/iblock: Send MISCOMPARE check
condition and sense
Apr  3 00:34:48 labsan1 kernel: [41584.159441] ABORT_TASK: Found
referenced qla2xxx task_tag: 1170576
Apr  3 00:34:48 labsan1 kernel: [41584.159446] ABORT_TASK: ref_tag:
1170576 already complete, skipping
Apr  3 00:34:48 labsan1 kernel: [41584.159448] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576
Apr  3 00:34:48 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1170576
Apr  3 00:34:48 labsan1 kernel: ABORT_TASK: ref_tag: 1170576 already
complete, skipping
Apr  3 00:34:48 labsan1 kernel: ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1170576
Apr  3 00:34:49 labsan1 kernel: [41585.156976] ABORT_TASK: Found
referenced qla2xxx task_tag: 1171236
Apr  3 00:34:49 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1171236
Apr  3 00:34:50 labsan1 kernel: [41586.351683] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1171236
Apr  3 00:34:50 labsan1 kernel: [41586.351691] ABORT_TASK: Found
referenced qla2xxx task_tag: 1172644
Apr  3 00:34:50 labsan1 kernel: ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1171236
Apr  3 00:34:50 labsan1 kernel: ABORT_TASK: Found referenced qla2xxx
task_tag: 1172644
Apr  3 00:34:50 labsan1 kernel: [41586.472423] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1172644
Apr  3 00:34:50 labsan1 kernel: [41586.472432] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1147388
Apr  3 00:34:50 labsan1 kernel: [41586.472438] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171368
Apr  3 00:34:50 labsan1 kernel: [41586.472441] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171456
Apr  3 00:34:50 labsan1 kernel: [41586.472445] ABORT_TASK: Sending
TMR_TASK_DOES_NOT_EXIST for ref_tag: 1171500
------------------------------------------------------------------------------------------/SNIP-------------------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------SNIP--------------------------------------------------------------------------------------------------------------
Apr  3 02:04:40 labsan1 kernel: [ 2282.079013]  rport-6:0-5: blocked
FC remote port time out: no longer a FCP target, removing starget
Apr  3 02:04:40 labsan1 kernel: rport-6:0-5: blocked FC remote port
time out: no longer a FCP target, removing starget
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] ------------[ cut here
]------------
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] WARNING: CPU: 3 PID: 0
at kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0()
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] Watchdog detected hard
LOCKUP on cpu 3
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] Modules linked in:
tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
 target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter
ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute
 bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security i
p6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip
table_security iptable_raw coretemp kvm_intel kvm iTCO_wdt
iTCO_vendor_support gpio_ich ipmi_ssif ipmi_devintf lpc_ich mfd_core
i5000
_edac serio_raw edac_core ses ioatdma enclosure i5k_amb shpchp dca
ipmi_si acpi_cpufreq ipmi_msghandler nfsd auth_rpcgss nfs_acl lock
d grace sunrpc radeon i2c_algo_bit drm_kms_helper ttm drm qla2xxx bnx2
ata_generic pata_acpi scsi_transport_fc aacraid
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] CPU: 3 PID: 0 Comm:
swapper/3 Not tainted 3.19.3-200.fc21.x86_64 #1
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] Hardware name: IBM IBM
eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS -[BCE14
8BUS-1.21]- 04/04/2011
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  0000000000000000
ef4cea677ede47e6 ffff88012fd85a60 ffffffff8176e215
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  0000000000000000
ffff88012fd85ab8 ffff88012fd85aa0 ffffffff8109bc1a
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  0000000000000000
ffff88012a510000 0000000000000000 ffff88012fd85c00
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] Call Trace:
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  <NMI>
[<ffffffff8176e215>] dump_stack+0x45/0x57
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8109bc1a>]
warn_slowpath_common+0x8a/0xc0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8109bca5>]
warn_slowpath_fmt+0x55/0x70
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81150ad2>]
watchdog_overflow_callback+0x92/0xc0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8119191b>]
__perf_event_overflow+0x9b/0x250
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81192434>]
perf_event_overflow+0x14/0x20
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8103460a>]
intel_pmu_handle_irq+0x1da/0x3f0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8102bafb>]
perf_event_nmi_handler+0x2b/0x50
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81018fd8>]
nmi_handle+0x88/0x130
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81019562>]
default_do_nmi+0x42/0x110
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810196b8>]
do_nmi+0x88/0xd0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81776d21>]
end_repeat_nmi+0x1e/0x2e
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8177448a>] ?
_raw_spin_lock_irqsave+0x4a/0x60
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  <<EOE>>  <IRQ>
[<ffffffffa00eade2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx]
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffffa008cc13>]
qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx]
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffffa00ac8a9>]
qla2x00_async_event+0xe39/0x1890 [qla2xxx]
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810cb828>] ?
sched_clock_cpu+0x88/0xb0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810c04ca>] ?
update_rq_clock.part.78+0x1a/0xe0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810cf9c3>] ?
update_blocked_averages+0x2f3/0x7a0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffffa00ade41>]
qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx]
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810f2a47>]
handle_irq_event_percpu+0x77/0x1a0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810f2bab>]
handle_irq_event+0x3b/0x60
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8101e8ca>] ?
native_sched_clock+0x2a/0x90
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810f5a99>]
handle_fasteoi_irq+0x79/0x120
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff81017414>]
handle_irq+0x74/0x140
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810bb54a>] ?
atomic_notifier_call_chain+0x1a/0x20
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8177777f>]
do_IRQ+0x4f/0xf0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8177556d>]
common_interrupt+0x6d/0x6d
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  <EOI>
[<ffffffff81103c98>] ? hrtimer_start+0x18/0x20
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8105ea56>] ?
native_safe_halt+0x6/0x10
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810fabb3>] ?
rcu_eqs_enter+0xa3/0xb0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8101f97e>]
default_idle+0x1e/0xc0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8102034f>]
arch_cpu_idle+0xf/0x20
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff810de15a>]
cpu_startup_entry+0x37a/0x3c0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715]  [<ffffffff8104af3a>]
start_secondary+0x1aa/0x1f0
Apr  3 02:05:10 labsan1 kernel: [ 2252.121715] ---[ end trace
05113eaf8d7e2e9d ]---
Apr  3 02:05:10 labsan1 kernel: [ 2312.441004] INFO: rcu_sched
detected stalls on CPUs/tasks: { 3} (detected by 0, t=60003 jiffies,
g=3933, c=3932, q=0)
Apr  3 02:05:10 labsan1 kernel: [ 2312.441207] Task dump for CPU 3:
Apr  3 02:05:10 labsan1 kernel: [ 2312.441211] swapper/3       R
running task        0     0      1 0x00000008
Apr  3 02:05:10 labsan1 kernel: [ 2312.441215]  0000000000000000
0000000000000000 ffffffffffffff1e ffffffff8105ea56
Apr  3 02:05:10 labsan1 kernel: [ 2312.441219]  0000000000000010
0000000000000246 ffff88012abf7e88 0000000000000018
Apr  3 02:05:10 labsan1 kernel: [ 2312.441222]  ffffffff810fabb3
ffff88012abf7ea8 ffffffff8101f97e ffffffff81d2a6c0
Apr  3 02:05:10 labsan1 kernel: [ 2312.441226] Call Trace:
Apr  3 02:05:10 labsan1 kernel: [ 2312.441237]  [<ffffffff8105ea56>] ?
native_safe_halt+0x6/0x10
Apr  3 02:05:10 labsan1 kernel: [ 2312.441242]  [<ffffffff810fabb3>] ?
rcu_eqs_enter+0xa3/0xb0
Apr  3 02:05:10 labsan1 kernel: [ 2312.441247]  [<ffffffff8101f97e>] ?
default_idle+0x1e/0xc0
Apr  3 02:05:10 labsan1 kernel: [ 2312.441251]  [<ffffffff8102034f>] ?
arch_cpu_idle+0xf/0x20
Apr  3 02:05:10 labsan1 kernel: [ 2312.441254]  [<ffffffff810de15a>] ?
cpu_startup_entry+0x37a/0x3c0
Apr  3 02:05:10 labsan1 kernel: [ 2312.441259]  [<ffffffff8104af3a>] ?
start_secondary+0x1aa/0x1f0
Apr  3 02:05:10 labsan1 kernel: ------------[ cut here ]------------
Apr  3 02:05:10 labsan1 kernel: WARNING: CPU: 3 PID: 0 at
kernel/watchdog.c:317 watchdog_overflow_callback+0x92/0xc0()
Apr  3 02:05:10 labsan1 kernel: Watchdog detected hard LOCKUP on cpu 3
Apr  3 02:05:10 labsan1 kernel: Modules linked in:
Apr  3 02:05:16 labsan1 kernel: [ 2318.111331] ABORT_TASK: Found
referenced qla2xxx task_tag: 1192752
Apr  3 02:05:30 labsan1 kernel: [ 2332.303075] ABORT_TASK: Sending
TMR_FUNCTION_COMPLETE for ref_tag: 1192752
Apr  3 02:05:30 labsan1 kernel: [ 2332.303084] ABORT_TASK: Found
referenced qla2xxx task_tag: 1199704
------------------------------------------------------------------------------------------/SNIP--------------------------------------------------------------------------------------------------------------

Thanks,
Dan
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux