On Wednesday, March 03/04/20, 2020 at 23:56:12 +0530, Potnuri Bharat Teja wrote: > On Friday, February 02/28/20, 2020 at 05:12:32 +0530, Sagi Grimberg wrote: > > > > >>> Hi All, > > >>> I observe connection errors almost immediately after I start iozone over iser > > >>> luns. Atached are the connection error and hung task traces on initator and > > >>> target respecively. > > >>> Interestingly, I see connection errors only if LUN size is less than 512MB. > > >>> In my case I could consistently reproduce the issue with 511MB LUN and 300MB > > >>> lun size. Connections errors are not seen if I create 512MB or greated LUN. > > >> > > >> Can you share log output on the target to before hung tasks? > > > > > > Sure, Attached are the target and initiator dmesg logs. > > >> > > >>> Further, after the connection errors, I noticed that the poll work queue is > > >>> stuck and never processes drain CQE resulting in hung tasks on the target side. > > >> > > >> Is the drain CQE actually generated? > > >> > > > > > > Yes it is generated. I was able to track it with prints until queue_work() in > > > ib_cq_completion_workqueue(). Work Function ib_cq_poll_work() is never getting > > > scheduled. Therefore, I see drain CQE unpolled and hung task due to > > > __ib_drain_sq() waiting forever for complete() to be called from drain CQE > > > done() handler. > > > > Hmm, that is interesting. This tells me that cq->work is probably > > blocked by another completion invokation (which hangs), which means that > > queuing the cq->work did not happen as workqueues are not re-entrant. > > > > Looking at the code, nothing should be blocking in the isert ->done() > > handlers, so its not clear to me how this can happen. > > > > Would it be possible to run: > > echo t > /proc/sysrq-trigger when this happens? I'd like to see where > > that cq->work is blocking. > > > Attached file t_sysrq-trigger_and_dmesg.txt is the triggered output. Please let > me know if that is timed correctly as I triggered it a little after login timeout. > I'll try getting a better one meanwhile. > > I'd also enable pr_debug on iscsi_traget.c > > > Attached files are with debug enabled: > tgt_discovery_and_login_dmesg.txt -> dmesg just after login for reference > tgt_IO_511MB_8target_1lun_each_iozone_dmesg_untill_hang.txt -> dmesg untill connection error. > > Please let me know if there is anything that I could check. Hi Sagi, Got any chance to check this? Thanks. > > > > > >>> I tried changing the CQ poll workqueue to be UNBOUND but it did not fix the issue. > > >>> > > >>> Here is what my test does: > > >>> Create 8 targets with 511MB lun each, login and format disks to ext3, mount the > > >>> disks and run iozone over them. > > >>> #iozone -a -I -+d -g 256m > > >> > > >> Does it happen specifically with iozone? or can dd/fio also > > >> reproduce this issue? on which I/O pattern do you see the issue? > > >> > > > I see it with iozone. I am trying with fio, shall soon update. > > > I see issue with at iosizes around 128k/256k block sizes of iozone. Its not > > > consistent. > > >>> I am not sure how LUN size could cause the connection errors. I appreciate any > > >>> inputs on this. > > >> > > >> I imagine that a single LUN is enough to reproduce the issue? > > >> > > > > > > yes, attached is the target conf. > > >> btw, I tried reproducing the issue with rxe (couldn't setup an iser > > >> listener with siw) in 2 VMs on my laptop using lio to a file backend but > > >> I cannot reproduce the issue.. > > > I see the issue quickly with 40G/25G links. I have not seen the issue on a 100G > > > link. BTW i a trying iwarp(T6/t5) > > > > > > Thanks for looking into it. > > > > > > > From the log, looks like the hang happens when the initiator tries to > > login after the failure (trace starts in iscsi_target_do_login). and > > looks like the target gave up on login timeout, but what is not > > indicated is why did the initiator got a ping timeout in the > > first place...