Re: LUNs become unavailable with current git HEAD

Thomas Glanzmann <thomas@xxxxxxxxxxxx> · Mon, 14 Oct 2013 22:16:58 +0200

Hello Nab,

> Oct 11 12:14:40 node-62 kernel: [220709.511271] ABORT_TASK: Found referenced iSCSI task_tag: 7795
> Oct 11 12:14:40 node-62 kernel: [220709.511275] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 7795
> Oct 11 12:14:40 node-62 kernel: [220709.511297] ABORT_TASK: Found referenced iSCSI task_tag: 7797
> Oct 11 12:14:40 node-62 kernel: [220709.511298] ABORT_TASK: ref_tag: 7797 already complete, skipping
> Oct 11 12:14:40 node-62 kernel: [220709.511299] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797
> Oct 11 12:14:40 node-62 kernel: [220709.511308] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7795
> Oct 11 12:14:40 node-62 kernel: [220709.511322] ABORT_TASK: Found referenced iSCSI task_tag: 7797
> Oct 11 12:14:40 node-62 kernel: [220709.511323] ABORT_TASK: ref_tag: 7797 already complete, skipping
> Oct 11 12:14:40 node-62 kernel: [220709.511324] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 7797
> Oct 11 12:14:40 node-62 kernel: [220709.511420] ABORT_TASK: Found referenced iSCSI task_tag: 37422
> Oct 11 12:14:40 node-62 kernel: [220709.511422] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 37422
> Oct 11 12:14:40 node-62 kernel: [220709.511425] ABORT_TASK: Found referenced iSCSI task_tag: 37423
> Oct 11 12:14:40 node-62 kernel: [220709.511426] ABORT_TASK: ref_tag: 37423 already complete, skipping
> Oct 11 12:14:40 node-62 kernel: [220709.511427] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 37423
> Oct 11 12:14:40 node-62 kernel: [220709.511503] ABORT_TASK: Found referenced iSCSI task_tag: 6574
> Oct 11 12:14:40 node-62 kernel: [220709.511505] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 6574
> Oct 11 12:14:40 node-62 kernel: [220709.511619] ABORT_TASK: Found referenced iSCSI task_tag: 1593
> Oct 11 12:14:40 node-62 kernel: [220709.511623] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 1593
> Oct 11 12:14:40 node-62 kernel: [220709.516794] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000009f
> Oct 11 12:14:40 node-62 kernel: [220709.524610] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x000000a0
> Oct 11 12:14:40 node-62 kernel: [220709.524690] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000007f
> Oct 11 12:14:40 node-62 kernel: [220709.524805] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000013
> Oct 11 12:14:40 node-62 kernel: [220709.546974] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000075
> Oct 11 12:14:41 node-62 kernel: [220710.320215] kmem_cache_destroy lio_qr_cache: Slab cache still has objects
> Oct 11 12:14:41 node-62 kernel: [220710.327218] CPU: 11 PID: 10178 Comm: rmmod Tainted: G           O 3.12.0-rc3+ #1
> Oct 11 12:14:41 node-62 kernel: [220710.327222] Hardware name: Supermicro X9SRD-F/X9SRD-F, BIOS 1.0a 10/15/2012
> Oct 11 12:14:41 node-62 kernel: [220710.327226]  0000000000000000 0000000000000000 ffffffff8137d7c1 ffff8810214093c0
> Oct 11 12:14:41 node-62 kernel: [220710.327233]  ffffffff810d9a27 ffffffffa03e7e50 ffffffffa03d6d98 ffffffffa03d6d70
> Oct 11 12:14:41 node-62 kernel: [220710.327238]  ffffffff8108442e ffffffff8105946c 00000000813806ec ffffffff81645210
> Oct 11 12:14:41 node-62 kernel: [220710.327244] Call Trace:
> Oct 11 12:14:41 node-62 kernel: [220710.327256]  [<ffffffff8137d7c1>] ? dump_stack+0x41/0x51
> Oct 11 12:14:41 node-62 kernel: [220710.327282]  [<ffffffff810d9a27>] ? kmem_cache_destroy+0xcb/0xdf
> Oct 11 12:14:41 node-62 kernel: [220710.327297]  [<ffffffffa03d6d98>] ? iscsi_target_cleanup_module+0x28/0x290 [iscsi_target_mod]
> Oct 11 12:14:41 node-62 kernel: [220710.327308]  [<ffffffffa03d6d70>] ? iscsit_put_transport+0xf/0xf [iscsi_target_mod]
> Oct 11 12:14:41 node-62 kernel: [220710.327315]  [<ffffffff8108442e>] ? SyS_delete_module+0x215/0x299
> Oct 11 12:14:41 node-62 kernel: [220710.327321]  [<ffffffff8105946c>] ? should_resched+0x5/0x23
> Oct 11 12:14:41 node-62 kernel: [220710.327327]  [<ffffffff81381b32>] ? page_fault+0x22/0x30
> Oct 11 12:14:41 node-62 kernel: [220710.327333]  [<ffffffff81386722>] ? system_call_fastpath+0x16/0x1b

> So the kmem_cache_destroy item indicates that the target service was
> stopped very soon after the ABORT_TASKs occured.

> Can you confirm that the target was stopped, and not restarted..?

I confirm this. What happened is the following: We had two LUNs in demo
accessed by 8 ESX servers, on each ESX server we had at least VMs on it.
Each ESX had 3 private LUNs presented. I also setup another VLAN with a
similiar setup, but noone connected to these LUNs. And two LUNs I used
for a Linux iscsi initiator tests for Doug.

Shortly before the inicident happened, we upgraded all our ESX servers
from 5.1 GA to the newest available build, always two at the same time.
Than I started to do the course evaluation and during the course
evaluation a participant told me that his ESX server locked up and he
add a lot of alerts and everything stalled, a second one said me, too
and another. So the first thing I checked was the iSCSI target and I saw
the dmesg output. So I typed in:

        reboot

And this is when target was stopped. The system rebooted and came up and
the ESX servers got the connection back and everything returned to to
normal. A few tasks in vCenter took little while to timeout but that is
normal after such a situation.

This weekend I was slacking to recharge after last weeks 16 hour days.
:-) I'll try to reproduce this issue but I assume it is hard, because I
do not know what triggered it and I don't want to run another class
until I'm able to reproduce it and it is fixed with it. But I'll report
back as soon as I can.

If there is a hunch or any idea what could have triggered it, let me
know otherwise I'll do the followin in order to try to reproduce it:

        - Install 12 ESX servers
        - Do a setup similiar to the lab setup.
        - Generate load by provisioning VMs, booting them up, migrating
          them.
        - Update the ESX servers.

And I hope it will trigger, but the thing is the target was four days
rock stable and we punched with 27 concurrent svMotion. Every so called
enterprise storage would have yielded. :-)

Cheers,
        Thomas
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html