And after two days of sctpspray pounding, hit the bug.
Here's a partial write-up using examples from the core file:
=========================================================================
Fact 1: a renege operation will only be launched if there's a gap in
the tsn.
So a reassembly queue like this one would not be set upon:
PID 1784
sctp_association 0xffff88041b6a2000
tsn_map = 0xffff88041dd8d560,
base_tsn = 0x55751715,
cumulative_tsn_ack_point = 0x55751714,
max_tsn_seen = 0x55751714,
reasm queue summary:
ssn = 0x345, tsn = 0x5575170c, msg_flags = 0x2, rmem_len = 0x69c
ssn = 0x345, tsn = 0x5575170d, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x5575170e, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x5575170f, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x55751710, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x55751711, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x55751712, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x55751713, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x345, tsn = 0x55751714, msg_flags = 0x0, rmem_len = 0x69c
No gap: the last tsn in the reasm queue matches cumulative_tsn_ack_point = 0x55751714,
In our case, I believe the reasm queue looked like this when the renege launched:
In sctp_association 0xffff88041b845000
base_tsn = 0x936a6d76,
cumulative_tsn_ack_point = 0x936a6d75,
max_tsn_seen = 0x936a6d79,
ssn = 0x0, tsn = 0x936a6d6f, msg_flags = 0x2, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d70, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d71, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d72, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d73, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d74, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d75, msg_flags = 0x0, rmem_len = 0x69c
ssn = 0x0, tsn = 0x936a6d78, msg_flags = 0x0, rmem_len = 0x628
The gap between 75 and 78 meant that a renege could be launched.
It was launched for tsn = 0x936a6d76 (just arriving and apparently out
of memory), and the "needed" amount was 0x05ac (1452 bytes).
tsn = 0x936a6d78 was removed, and the amount recovered was 0x538 (1336 bytes).
The value of "rmem_len" in the event is not what is used to calculated needed
and freed.
Since 0x538 didn't satisfy 0x5ac, it went for the next one down on the queue
(tsn = 0x936a6d75)
and recovered 0x5ac from it for a total recovery of 0xae4 (2788 bytes).
So because the first post-gap fragment happened to be a LAST_FRAG and shorter than
the rest of them, it wasn't enough to satisfy the request and we moved on
to the one that caused the BUG.
If there had been two gapped frags, or if the gapped frag had been another
middle one that was big enough to satisfy the request, it would not have
been caught freeing a fragment that was at the cumulative tsn ack point.
========================================================================
Since the base_tsn and cumulative_tsn_ack_point are advanced in
sctp_ulpevent_make_rcvmsg() before putting the fragments on the
reasm queue, the renege code should not be allowed to dip below
that point in sctp_ulpq_renege_list(). Otherwise, you're
discarding undelivered data that you've already reported as
"delivered" to the sender, right?
Thanks,
Bob Montgomery