On Wed, 2014-05-21 at 13:01 +0300, Charalampos Pournaris wrote: > Hi Thomas, > > On Tue, May 20, 2014 at 10:44 PM, Thomas Glanzmann <thomas@xxxxxxxxxxxx> wrote: > > Hello Harry, > > > >> http://pastebin.com/AqqJaYVX > > > > I checked my log from 11th October last year when this happened to me, > > and for me it looks like the same error we're hitting: > > > > ... > > Oct 11 11:53:56 node-62 kernel: [219465.151250] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5488 > > Oct 11 11:53:56 node-62 kernel: [219465.151261] ABORT_TASK: Found referenced iSCSI task_tag: 5494 > > Oct 11 11:53:56 node-62 kernel: [219465.151264] ABORT_TASK: ref_tag: 5494 already complete, skipping > > Oct 11 11:53:56 node-62 kernel: [219465.151267] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5494 > > Oct 11 11:53:56 node-62 kernel: [219465.151271] ABORT_TASK: Found referenced iSCSI task_tag: 5495 > > Oct 11 11:53:56 node-62 kernel: [219465.151273] ABORT_TASK: ref_tag: 5495 already complete, skipping > > Oct 11 11:53:56 node-62 kernel: [219465.151275] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5495 > > Oct 11 11:54:09 node-62 kernel: [219478.744212] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000008 > > Oct 11 11:54:09 node-62 kernel: [219478.751738] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5508 > > Oct 11 11:54:23 node-62 kernel: [219492.351282] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000013 > > Oct 11 11:54:23 node-62 kernel: [219492.358819] ABORT_TASK: Sending TMR_TASK_DOES_NOT_EXIST for ref_tag: 5514 > > Oct 11 11:54:23 node-62 kernel: [219492.630489] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x0000001d > > ... > > > > Full log here: https://thomas.glanzmann.de/crash/extracted_crash_dmesg.log > > > > I'll try to reproduce it and let the list know when I find something. > > Harry, can you let me know how many VM's you had running and if they > > were thin or thick provisioned and if they're idle or they had a heavy > > workload. > > > > In my case I did _not_ use jumbo frames, but 802.3ad bonding. My > > switches are configured to be able to deliver jumbo frames. > > > > Cheers, > > Thomas > > Indeed, it seems that we hit the same issue as the log lines look > pretty similar. > One other thing that has historically caused ESX issues is having too many LUNs on a individual iSCSI-Target TargetName+TargetPortalGroupTag endpoint. This typically manifests itself as false positive I/O timeouts (on the ESX host side) when multiple LUNs are exported over 1 Gb/sec Ethernet on a single endpoint. (eg: accessed over a single session). Usually spreading out the LUNs across multiple endpoints (eg: multiple IQNs) helps to reduce or completely eliminate the ESX host side issues. Typically you want to keep it around 4-5 LUNs per endpoint on 1 Gb/sec ports. > Our setup was comprised of around 7-10 VMs powered on with some > activity (not too intense), and if I recall correctly some VM > deployments (through OVF/OVA) had been made prior to the failure. > Obviously, there are no discrete steps to reproduce the problem... I > hope this can help reproduce the problem in your environment as it's > kind of difficult for me to make changes in our production one (e.g. > to recompile the kernel with debug on). > FYI, as long as CONFIG_DYNAMIC_DEBUG is compiled in, the debug output for the target + associated drivers can be turned on/off on the fly using debugfs. --nab -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html