On 08/02/2016 11:29 PM, Doug Ledford wrote: > On Tue, 2016-08-02 at 23:18 +0300, Nikolay Borisov wrote: >> On Tue, Aug 2, 2016 at 10:21 PM, Doug Ledford <dledford@xxxxxxxxxx> >> wrote: >>> >>> On Thu, 2016-07-21 at 10:31 +0300, Nikolay Borisov wrote: >>>> >>>> Hello, >>>> >>>> With running the risk of sounding like a broken record, I came >>>> across >>>> another case where ipoib can cause the machine to go haywire due >>>> to >>>> missed join requests. This is on 4.4.14 kernel. Here is what I >>>> believe >>>> happens: >>> >>> [ snip long traces ] >>> >>>> >>>> This makes me wonder if using timeouts is actually better than >>>> blindly relying on completing the join. >>> >>> Blindly relying on the join completions is not what we do. We are >>> very >>> careful to make sure we always have the right locking so that we >>> never >>> leave a join request in the BUSY state without running the >>> completion >>> at some time. If you are seeing us do that, then it means we have >>> a >>> bug in our locking or state processing. The answer then is to find >>> that bug and not to paper over it with a timeout. Can you find >>> some >>> way to reproduce this with a 4.7 kernel? >> >> Unfortunately my environment is constrained to 4.4 kernel. I will, >> however, >> try and check if I can get a couple of IB-enabled nodes on 4.7 and >> see >> if something >> shows up. And while I don't have a 100% reproducer for it I see those >> symptoms rather regularly >> on production nodes. I'm able and happy to extract any runtime state >> that might be useful in debugging this i.e I can obtain crashdumps >> and >> reverse the state of the ipoib stacks. I've seen this issue on 3.12 >> and on 4.4. >> Some of my previous emails also show this manifesting in hangs in >> cm_destroy_id >> as well. So clearly there is a problem there but it proves very >> elusive. > > Can you give any clues as to what's causing it? Do you have link flap? > SM bounces? Lots of multicast joins/leaves? Hello again, after some testing and a lot more reboots I think we've managed to isolate a culprit. Based on data we've observed on the switches it seems that when a particular switch is congested it can start queuing packets internally, and after its queue overflows it will start dropping packets. Our switches show that they are discarding a lot of packets when we increase the amount of traffic. Since our network is linear e.g. switch 1 -> switch 2-> switch 3 then if node on sw1 wants to send packets to a node on sw2 and sw2 is congested then it might silently discard packets. And this in turn causes the ipoib (and the MAD drivers) to wait for a response on a packet they sent, but that never got sent to its destination. Does that sound plausible? > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html