On Thu, 2013-01-31 at 10:08 -0500, Vlad Yasevich wrote: > On 01/30/2013 11:30 PM, Roberts, Lee A. wrote: > > Vlad, > > > > The test code that I'm running at the moment has changes similar to the following. > > I think we want to peek at the tail of the queue---and not dequeue (or unlink) the > > data until we're sure we want to renege. > > You are right. If Bob can send a signed-off patch linux-sctp and > netdev, we can get it upstream and into stable releases. > > -vlad Vlad, This is just one of many things we suspect, and doesn't explain (or fix) the hang we're looking at. Lee and I are working on a list of problems around renege, tsnmap management, reassembly, and partial delivery mode. Here's a current favorite potential issue (documented by Lee): In sctp_ulpq_renege(): /* If able to free enough room, accept this chunk. */ if (chunk && (freed >= needed)) { __u32 tsn; tsn = ntohl(chunk->subh.data_hdr->tsn); sctp_tsnmap_mark(&asoc->peer.tsn_map, tsn); sctp_ulpq_tail_data(ulpq, chunk, gfp); sctp_ulpq_partial_delivery(ulpq, chunk, gfp); } sctp_tsnmap_mark is called *before* calling sctp_ulpq_tail_data(). But sctp_ulpq_tail_data can fail to allocated memory and return -ENOMEM. So potentially we've marked this tsn as present and then failed to actually keep it, right? Here's another potential issue: Since an event in the lobby has a single tsn value, but it might have been reassembled from several fragments (with sequential tsn's), the renege_list operation only calls sctp_tsnmap_renege with the single tsn. So now I've discarded multiple tsn's worth of data, but only noted one of them in the map, right?? And another: Under normal operation, an event that fills a hole in the lobby will result in a list of events (the new one and sequential ones that had been waiting in the lobby) being sent to sctp_ulpq_tail_event(). Then we do this: /* Check if the user wishes to receive this event. */ if (!sctp_ulpevent_is_enabled(event, &sctp_sk(sk)->subscribe)) goto out_free; In out_free, we do sctp_queue_purge_ulpevents(skb_list); So if the first event was a notification that we don't subscribe to, but the remaining 100 were data, do we really throw out all the other data with it?? These don't explain my favorite hang either, but I think I'm finally getting close to that problem. These things uncovered while trying to understand this code, and the fact that we're not testing and debugging on the current kernel is why we're not sending in any patches yet. Thanks for any confirmation or insight you can provide :-) Bob Montgomery -- To unsubscribe from this list: send the line "unsubscribe linux-sctp" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html