On Mon, Nov 26, 2012 at 01:31:57PM +0000, Jamie Parsons wrote: > Hi, > > My name is Jamie Parsons. I am working on a test tool that uses lksctp (lksctp-tools.x86_64 on a Linux box with a 2.6.32-279.9.1.el6.x86_64 kernel) to drive the SCTP interface on one of our products and I think I may have found a bug with the peer receive window size. > > Having looked at this kernel maintainers list (http://lxr.linux.no/#linux+v3.6.7/MAINTAINERS) I believe that you are the people I should contact to report a bug. If not, please let me know who I should be talking to instead. > > If you are the correct people, can you please look at the detailed description below? I think that the issue may be some problem to do with data structures not being reinitialized correctly after receiving an unexpected INIT. > > I've had a quick look at recent check ins for the kernel and couldn't see anything which was obviously a fix for this bug. Would you be able to help debugging/fixing this issue? I'm happy to repro it to get any diagnostics required. > > Thanks for your help, > > Jamie > > ========================================== > > __TEST SETUP__ > I've set up an SCTP connection between a Linux box and a fault tolerant peer. I collect wireshark snoop from the Linux box throughout the test and periodically poll the Linux kernel for SCTP_STATUS using getsockopt() . > > After letting it run cleanly for a few minutes, I then deliberately induce a fault on the peer to make it failover. The peer then restarts the connection by sending an INIT to the Linux box (as covered by section 5.2.2 of RFC 4960). > > __SYMPTOMS__ > Initially, the peer is advertising a receive window of 2000 (I check this by looking at sctp.sack_a_rwnd in wireshark). I can check that the Linux SCTP agrees with this value by doing a getsockopt for SCTP_STATUS and checking the value of sstat_rwnd. At this stage there is no problem, the SCTP stack reports a value of 2000 with a slight deviation if there is some unacked data outstanding. All good so far! > > After failing over, the wireshark trace still shows that the peer is advertising a receive window of 2000. However, if I now check the peer receive window through the Linux SCTP stack as above, it reports a consistently lower value (of 916 in my last run) again with slight deviation if there are unacked packets. > > At this point I stop sending any data from the Linux box, and wait a couple of minutes to ensure that the send buffer is emptied and all packets have been acked. The last SACK sent by the peer has a receive window value of 2000 but the SCTP stack is still reporting a value of 916 with no packets unacked. > > The problem is compounded by the fact that the SCTP association now can't be brought down from the Linux side. I have set SO_LINGER 'on' with a time of 0. If I call shutdown(SCK, SHUT_RDWR) before a failover then I can see the Linux box send an ABORT in the wireshark trace to tear down the association. > > If I call shutdown(SCK, SHUT_RDWR) after the peer has failed over then no ABORT message is sent. Using SCTP_STATUS on getsockopt I can see that the stack is in state 5 (PENDING_SHUTDOWN) and stays there indefinitely, which means it is waiting for packets to be acked. This is despite the fact that it reports a value of 0 for unacked packets. > Can you provide a diagram of your network setup, a link to someplace I can see your tcpdump, and the specific kernel version that you're using? Neil > > -- > To unsubscribe from this list: send the line "unsubscribe linux-sctp" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html