Lots of conversation overnight! Let me see if I can answer. But before I do, I'm putting an observation first: Here's something that just jumped out at me. We're pushing the network by issuing the command: dd if=/dev/zero of=~/foo count=1 bs=33554432 which is writing data from the VPN client to my home directory (which is an NFS file server on the other side of the VPN) netstat -s before: Ip: 194773 total packets received 0 forwarded 0 incoming packets discarded 194770 incoming packets delivered 199630 requests sent out 4238 outgoing packets dropped 38 dropped because of missing route Tcp: 569 active connections openings 4 passive connection openings 25 failed connection attempts 0 connection resets received 8 connections established 96254 segments received 117889 segments send out 4057 segments retransmited 0 bad segments received. 358 resets sent Udp: 97795 packets received 10 packets to unknown port received. 0 packet receive errors 119674 packets sent 0 receive buffer errors 4238 send buffer errors Then do the dd command (which takes about 17 seconds to complete because writes are so slow) After: Ip: 214798 total packets received 0 forwarded 0 incoming packets discarded 214795 incoming packets delivered 237868 requests sent out 5729 outgoing packets dropped 38 dropped because of missing route Tcp: 572 active connections openings 4 passive connection openings 25 failed connection attempts 0 connection resets received 8 connections established 106255 segments received 143265 segments send out 5547 segments retransmited 0 bad segments received. 364 resets sent Udp: 107804 packets received 10 packets to unknown port received. 0 packet receive errors 145061 packets sent 0 receive buffer errors 5729 send buffer errors The delta is that: 1491 outgoing IP packets dropped 1490 TCP segments retransmitted 1491 UDP send buffer errors. So to me that reads as "1491 outbound UDP packets suffered "send buffer errors." That resulted in 1491 IP packets being dropped. That, in turn yielded 1490 TCP segments (which were inside IPSec UDP packets) being lost, and thus retransmitted. This REALLY smells funny. On a lark, I tried setting net.core_wmem_max, net.core.wmem_default, net.upv4.udp_wmem_min, etc to much larger values, with no measurable results. Now, as to your questions/observations: >> But a reconnect/rekey would be clearly visible in OpenConnect output. >> Tony, presumably you'd have seen that and mentioned it? The openconnect output is very quiet once the tunnel is established. We reboot the VMs every 24 hours, and during that whole 24 hours, this is the last thing in the log: Tunnel timeout (rekey interval) is 2880 minutes. (blah blah) ESP tunnel connected; exiting HTTPS mainloop. ... and that's seen at bootup. So since 2880 minutes (48 hours) exceeds the life of the VM, there's not going to be a re-key at all. >> I think Tony said it was the TCP sequence number, no? That part is >> mystifying. Unless/until the gateway has successfully decrypted the >> ESP packet, it should have no idea about the TCP seqno, right? Yes, it is the TCP sequence number seen in the retransmitted un-VPN'd traffic. When we tcpdump the iperf flows, we can see that the original packet containing TCP Sequence number is lost because a new packet with that sequence number arrives much later than it should. -----Original Message----- From: Daniel Lenski [mailto:dlenski@xxxxxxxxx] Sent: Monday, March 25, 2019 5:14 AM To: David Woodhouse Cc: Phillips, Tony; Nikos Mavrogiannopoulos; openconnect-devel@xxxxxxxxxxxxxxxxxxx Subject: Re: [EXTERNAL] Re: What throughput is reasonable? On Mon, Mar 25, 2019 at 11:54 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: > > On Mon, 2019-03-25 at 11:41 +0200, Daniel Lenski wrote: > > On Mon, Mar 25, 2019 at 10:29 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote: > > > > > > On Sun, 2019-03-24 at 19:13 +0200, Daniel Lenski wrote: > > > > > > > > Do I have this right? High packet loss from client→VPN, low packet > > > > loss from VPN→client? > > > > > > > > If so, I'm guessing your problems are MTU-related. > > > > > > Hm, wouldn't we expect that to be more consistent? If the full-sized > > > packets are getting lost, that would just stall and not lose the > > > *occasional* packet? > > > > Yeah… should be. My guess is based on a couple of previous > > less-detailed reports from users of earlier versions with GP. > > > > > If it really is a repeatable drop every N packets, I might be inclined > > > to look at sequence numbers and epoch handling. Are we doing any ESP > > > rekeying? > > > > We are rekeying, but only using the most naïve "tunnel rekey" method. > > AFAIK, that's all that GP supports. > > https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1153-1157 > > > > After a certain time has elapsed, we tear down the TLS connection and > > reconnect (using the same auth cookie), which also invalidates the > > previous ESP keys and requires us to start using new ones. We should > > handle late incoming packets using the "old" ESP keys correctly, using > > the same method as with Juniper. > > We might handling late incoming packets correctly, but we stop actually > sending them. I wonder if we should continue to send ESP packets on the > "old" connection even while we're doing the reconnect? Hmmm… I don't think so. I did a whole of testing of the "tap-dance" required to enable the ESP tunnel early on, and as far as I can tell there are two main points: 1) As soon as the client config request (POST /ssl-vpn/getconfig.esp) is received, any pre-existing ESP keys become invalid immediately, and new ESP keys become valid immediately. 2) If the client ever connects to the TLS tunnel (bogus pseudo-CONNECT GET request to /ssl-tunnel-connect.sslvpn), the existing ESP keys immediately become invalid. Given this, I believe the safe behavior is to disable the UDP connection entirely before the reconnect starts, and just let the outgoing packet queue grow. Currently, we're actually *not* disabling the UDP connection before starting the reconnect (https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1162-1171), but we probably should be. Maybe try this patch…? diff --git a/gpst.c b/gpst.c index a0dc81f..5cd1aab 100644 --- a/gpst.c +++ b/gpst.c @@ -1160,6 +1160,8 @@ int gpst_mainloop(struct openconnect_info *vpninfo, int *timeout) vpn_progress(vpninfo, PRG_ERR, _("GPST Dead Peer Detection detected dead peer!\n")); do_reconnect: + if (vpninfo->proto->udp_close) + vpninfo->proto->udp_close(vpninfo); ret = ssl_reconnect(vpninfo); if (ret) { vpn_progress(vpninfo, PRG_ERR, _("Reconnect failed\n")); > But a reconnect/rekey would be clearly visible in OpenConnect output. > Tony, presumably you'd have seen that and mentioned it? Yeah, there should be a PRG_INFO message on initial connection and reconnection: "Tunnel timeout (rekey interval) is %d minutes." Anyway, I kind of doubt reconnect/rekey is playing a role here… all the real GP VPNs I've heard about have rekey intervals of at least 20 minutes. > Also, you said that you hit this a repeatable 4142 packets into a TCP > connection? That was regardless of how long the VPN had been up? I think Tony said it was the TCP sequence number, no? That part is mystifying. Unless/until the gateway has successfully decrypted the ESP packet, it should have no idea about the TCP seqno, right? Dan _______________________________________________ openconnect-devel mailing list openconnect-devel@xxxxxxxxxxxxxxxxxxx http://lists.infradead.org/mailman/listinfo/openconnect-devel