On Fri, 2012-05-18 at 18:26 -0700, Nicholas A. Bellinger wrote: > On Fri, 2012-05-18 at 18:49 -0400, Jörn Engel wrote: > > On Thu, 17 May 2012 23:09:27 -0700, Nicholas A. Bellinger wrote: > > > > > > So that means your testing with it now, right..? > > > > Our kernel has diverged too much from yours to easily move patches > > back and forth, so no. > > > > But more importantly, we still don't have a complete patchset. > > > > Ok, I've pushed into lio-core the following WIP patches based on your > original patch + s_id + loop_id clearing patch from today: > > 6fc162d3 tcm_qla2xxx: Clear session s_id + loop_id earlier during shutdown > dfebe3b5 tcm_qla2xxx: Convert to TFO->put_session() usage > ec7cf009 target: Add TFO->put_session() caller for HW fabric session shutdown > > Today I've been testing these changes against typical active I/O during > tcm_qla2xxx endpoint shutdown, and against explict NodeACL + MappedLUNs > removal case. So far with active FCP of the order of ~100K IOPs with > random mixed mode 4K blocks, these patches are performing session > shutdown and unloading tcm_qla2xxx references as expected. > > Certainly these need more testing wrt to a number of special cases for > active I/O shutdown, but I think they look reasonable enough to put into > lio-core for now.. > > Please have a look and let me know if you have any problems getting it > applied for testing into your .39 tree. > Quick update here folks, So these three patches have been running over the weekend with active I/O shutdown using the same ~100K IOPs fio randrw mixed load with explicit NodeACL + MappedLUN=0 removal + re-add using qla2xxx rtslib scripts. During 5K test iterations with FC clients pushing fio randrw traffic, thus far there have been no OOPsen or outstanding I/O shutdown hangs during the explicit NodeACL + MappedLUN=0 removal ops.. After the test completed and tcm_qla2xxx was released, while unloading the qla2xxx LLD the following qla_tgt_cmd_cachep leakage warnings appear: [244560.843319] qla2xxx [0000:07:00.1]-4801:15: DPC handler waking up. [244560.850509] qla2xxx [0000:07:00.1]-4802:15: dpc_flags=0x201250. [244560.858036] qla2xxx [0000:07:00.1]-0121:15: Failed to enable receiving of RSCN requests: 0x2. [244560.867640] qla2xxx [0000:07:00.1]-480f:15: Loop resync scheduled. [244560.875896] qla2xxx [0000:07:00.1]-8837:15: F/W Ready - OK. [244560.882222] qla2xxx [0000:07:00.1]-883a:15: fw_state=3 (3, 63eb, 2, 0) curr time=103a47b53. [244560.894437] qla2xxx [0000:07:00.1]-4810:15: Loop resync end. [244560.900844] qla2xxx [0000:07:00.1]-4800:15: DPC handler sleeping. [244561.031733] qla2xxx [0000:07:00.1]-e802:15: tgt ffff8802646bc800, empty(sess_list)=1 sess_count=0 [244561.063875] qla2xxx [0000:07:00.1]-f80b:15: Waiting for 0 IRQ commands to complete (tgt ffff8802646bc800) [244561.074469] qla2xxx [0000:07:00.1]-f80c:15: Stop of tgt ffff8802646bc800 finished [244561.099207] sd 45:0:1:0: alua: Detached [244561.110278] sd 44:0:1:0: alua: Detached [244576.123716] ============================================================================= [244576.132936] BUG qla_tgt_cmd_cachep (Tainted: G O): Objects remaining on kmem_cache_close() [244576.143214] ----------------------------------------------------------------------------- [244576.143215] [244576.154181] INFO: Slab 0xffffea000420ee00 objects=25 used=1 fp=0xffff8801083bf740 flags=0x8000000000004080 [244576.165042] Pid: 22707, comm: rmmod Tainted: G O 3.4.0-rc2+ #65 [244576.172706] Call Trace: [244576.175525] [<ffffffff810c1e07>] slab_err+0x90/0x9e [244576.181159] [<ffffffff8105d901>] ? trace_hardirqs_on+0xd/0xf [244576.187667] [<ffffffff810abc10>] ? free_percpu+0x2c/0x112 [244576.193879] [<ffffffff810c6403>] kmem_cache_destroy+0x152/0x309 [244576.200674] [<ffffffff81099ae1>] ? mempool_destroy+0x43/0x47 [244576.207185] [<ffffffffa03a3782>] qlt_exit+0x3d/0x3f [qla2xxx] [244576.213790] [<ffffffffa03aa8d5>] qla2x00_module_exit+0x79/0xa6 [qla2xxx] [244576.221456] [<ffffffff8106a4bb>] sys_delete_module+0x1fb/0x25f [244576.228154] [<ffffffff811a1404>] ? lockdep_sys_exit_thunk+0x35/0x67 [244576.235337] [<ffffffff81377279>] system_call_fastpath+0x16/0x1b [244576.242134] INFO: Object 0xffff8801083ba2c8 @offset=8904 [244576.248150] ============================================================================= So I think we might (finally) have the target_wait_for_sess_cmds() hang addressed for tcm_qla2xxx with explicit NodeACL shutdown, but are still leaking descriptor memory in some qla_target.c exception path.. I'm now trying to isolate down the load to reproduce this leakage, and figure out why these descriptors are not being included in the ->sess_wait_list used by wait_for_sess_cmds... --nab -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html