On Tue, Jun 22, 2021 at 9:09 PM Xin Long <lucien.xin@xxxxxxxxx> wrote: > > On Tue, Jun 22, 2021 at 6:13 PM David Laight <David.Laight@xxxxxxxxxx> wrote: > > > > From: Xin Long > > > Sent: 22 June 2021 19:05 > > > > > > Overview(From RFC8899): > > > > > > In contrast to PMTUD, Packetization Layer Path MTU Discovery > > > (PLPMTUD) [RFC4821] introduces a method that does not rely upon > > > reception and validation of PTB messages. It is therefore more > > > robust than Classical PMTUD. This has become the recommended > > > approach for implementing discovery of the PMTU [BCP145]. > > > > > > It uses a general strategy in which the PL sends probe packets to > > > search for the largest size of unfragmented datagram that can be sent > > > over a network path. Probe packets are sent to explore using a > > > larger packet size. If a probe packet is successfully delivered (as > > > determined by the PL), then the PLPMTU is raised to the size of the > > > successful probe. If a black hole is detected (e.g., where packets > > > of size PLPMTU are consistently not received), the method reduces the > > > PLPMTU. > > > > This seems to take a long time (probably well over a minute) > > to determine the mtu. > I just noticed this is a misread of RFC8899, and the next probe packet > should be sent immediately once the ACK of the last probe is received, > instead of waiting the timeout, which should be for the missing probe. > > I will fix this with: > > diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c > index d29b579da904..f3aca1acf93a 100644 > --- a/net/sctp/sm_statefuns.c > +++ b/net/sctp/sm_statefuns.c > @@ -1275,6 +1275,8 @@ enum sctp_disposition > sctp_sf_backbeat_8_3(struct net *net, > return SCTP_DISPOSITION_DISCARD; > > sctp_transport_pl_recv(link); > + sctp_add_cmd_sf(commands, SCTP_CMD_PROBE_TIMER_UPDATE, > + SCTP_TRANSPORT(link)); > return SCTP_DISPOSITION_CONSUME; > } > > diff --git a/net/sctp/transport.c b/net/sctp/transport.c > index f27b856ea8ce..88815b98d9d0 100644 > --- a/net/sctp/transport.c > +++ b/net/sctp/transport.c > @@ -215,6 +215,11 @@ void sctp_transport_reset_probe_timer(struct > sctp_transport *transport) > { > int scale = 1; > > + if (transport->pl.probe_count == 0) { > + if (!mod_timer(&transport->probe_timer, jiffies + > transport->rto)) > + sctp_transport_hold(transport); > + return; > + } > if (timer_pending(&transport->probe_timer)) > return; > if (transport->pl.state == SCTP_PL_COMPLETE && > > Thanks for the comment. A more efficient improvement: diff --git a/net/sctp/sm_statefuns.c b/net/sctp/sm_statefuns.c index d29b579da904..d5cb0124bafa 100644 --- a/net/sctp/sm_statefuns.c +++ b/net/sctp/sm_statefuns.c @@ -1275,7 +1275,13 @@ enum sctp_disposition sctp_sf_backbeat_8_3(struct net *net, return SCTP_DISPOSITION_DISCARD; sctp_transport_pl_recv(link); - return SCTP_DISPOSITION_CONSUME; + + if (link->pl.state == SCTP_PL_COMPLETE) { + sctp_add_cmd_sf(commands, SCTP_CMD_PROBE_TIMER_UPDATE, + SCTP_TRANSPORT(link)); + return SCTP_DISPOSITION_CONSUME; + } + return sctp_sf_send_probe(net, ep, asoc, type, link, commands); } max_interval = link->hbinterval + link->rto; diff --git a/net/sctp/transport.c b/net/sctp/transport.c index f27b856ea8ce..37f65f617f05 100644 --- a/net/sctp/transport.c +++ b/net/sctp/transport.c @@ -215,10 +215,8 @@ void sctp_transport_reset_probe_timer(struct sctp_transport *transport) { int scale = 1; - if (timer_pending(&transport->probe_timer)) - return; if (transport->pl.state == SCTP_PL_COMPLETE && - transport->pl.probe_count == 1) + transport->pl.probe_count == 0) scale = 30; /* works as PMTU_RAISE_TIMER */ if (!mod_timer(&transport->probe_timer, jiffies + transport->probe_interval * scale)) [103] pl_send: PLPMTUD: state: 1, size: 1200, high: 0 <--[a] [103] pl_recv: PLPMTUD: state: 1, size: 1200, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1232, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1232, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1264, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1264, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1296, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1296, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1328, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1328, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1360, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1360, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1392, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1392, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1424, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1424, high: 0 [103] pl_send: PLPMTUD: state: 2, size: 1456, high: 0 [103] pl_recv: PLPMTUD: state: 2, size: 1456, high: 0 <--[b] [103] pl_send: PLPMTUD: state: 2, size: 1488, high: 0 [108] pl_send: PLPMTUD: state: 2, size: 1488, high: 0 [113] pl_send: PLPMTUD: state: 2, size: 1488, high: 0 [118] pl_send: PLPMTUD: state: 2, size: 1488, high: 0 [118] pl_recv: PLPMTUD: state: 2, size: 1456, high: 1488 <---[c] [118] pl_send: PLPMTUD: state: 2, size: 1460, high: 1488 [118] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1488 <--- [d] [118] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488 [124] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488 [129] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488 [134] pl_send: PLPMTUD: state: 2, size: 1464, high: 1488 [134] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1464 <-- around 30s "search complete from 1200 bytes" [287] pl_send: PLPMTUD: state: 3, size: 1460, high: 0 [287] pl_recv: PLPMTUD: state: 3, size: 1460, high: 0 [287] pl_send: PLPMTUD: state: 2, size: 1464, high: 0 <-- [aa] [292] pl_send: PLPMTUD: state: 2, size: 1464, high: 0 [298] pl_send: PLPMTUD: state: 2, size: 1464, high: 0 [303] pl_send: PLPMTUD: state: 2, size: 1464, high: 0 [303] pl_recv: PLPMTUD: state: 2, size: 1460, high: 1464 <--[bb] <-- around 15s "re-search complete from current pmtu" So since no interval to send the next probe when the ACK is received for the last one, it won't take much time from [a] to [b], and [c] to [d], and there are at most 2 failures to find the right pmtu, each failure takes 5s * 3 = 15s. when it goes back to search from search complete after a long timeout, it will take only 1 failure to get the right pmtu from [aa] to [bb]. Thanks. > > > > What is used for the actual mtu while this is in progress? > > > > Does packet loss and packet retransmission cause the mtu > > to be reduced as well? > No, the data packet is not a probe in this implementation. > > > > > I can imagine that there is an expectation (from the application) > > that the mtu is that of an ethernet link - perhaps less a PPPoE > > header. > > Starting with an mtu of 1200 will break this assumption and may > > have odd side effects. > Starting searching from mtu of 1200, but the real pmtu will only be updated > when the search is done and optimal mtu is found. > So at the beginning, it will still use the dst mtu as before. > > > For TCP/UDP the ICMP segmentation required error is immediate > > and gets used for the retransmissions. > > This code seems to be looking at separate timeouts - so a lot of > > packets could get discarded and application timers expire before > > if determines the correct mtu. > This patch will also process ICMP error msg, and gets the 'mtu' size from it > but before using it, it will verify(probe) it first: > > see Patch: sctp: do state transition when receiving an icmp TOOBIG packet > > > > > Maybe I missed something about this only being done on inactive > > paths? > > > > David > > > > - > > Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK > > Registration No: 1397386 (Wales) > >