RE: [EXTERNAL] Re: What throughput is reasonable?

"Phillips, Tony" <tonyphillips@xxxxxx> · Mon, 25 Mar 2019 15:27:41 +0000

Lots of conversation overnight!  Let me see if I can answer.  But before I do, I'm putting an observation first:

Here's something that just jumped out at me.

We're pushing the network by issuing the command:
dd if=/dev/zero of=~/foo count=1 bs=33554432

which is writing data from the VPN client to my home directory (which is an NFS file server on the other side of the VPN) 

netstat -s before:

Ip:
    194773 total packets received
    0 forwarded
    0 incoming packets discarded
    194770 incoming packets delivered
    199630 requests sent out
    4238 outgoing packets dropped
    38 dropped because of missing route
Tcp:
    569 active connections openings
    4 passive connection openings
    25 failed connection attempts
    0 connection resets received
    8 connections established
    96254 segments received
    117889 segments send out
    4057 segments retransmited
    0 bad segments received.
    358 resets sent
Udp:
    97795 packets received
    10 packets to unknown port received.
    0 packet receive errors
    119674 packets sent
    0 receive buffer errors
    4238 send buffer errors

Then do the dd command (which takes about 17 seconds to complete because writes are so slow)

After:
Ip:
    214798 total packets received
    0 forwarded
    0 incoming packets discarded
    214795 incoming packets delivered
    237868 requests sent out
    5729 outgoing packets dropped
    38 dropped because of missing route
Tcp:
    572 active connections openings
    4 passive connection openings
    25 failed connection attempts
    0 connection resets received
    8 connections established
    106255 segments received
    143265 segments send out
    5547 segments retransmited
    0 bad segments received.
    364 resets sent
Udp:
    107804 packets received
    10 packets to unknown port received.
    0 packet receive errors
    145061 packets sent
    0 receive buffer errors
    5729 send buffer errors

The delta is that:
  1491 outgoing IP packets dropped
  1490 TCP segments retransmitted
  1491 UDP send buffer errors.

So to me that reads as "1491 outbound UDP packets suffered "send buffer errors."  
That resulted in 1491 IP packets being dropped.
That, in turn yielded 1490 TCP segments (which were inside IPSec UDP packets) being lost, and thus retransmitted.

This REALLY smells funny.

On a lark, I tried setting net.core_wmem_max, net.core.wmem_default, net.upv4.udp_wmem_min, etc to much larger values, with no measurable results.

Now, as to your questions/observations:

>> But a reconnect/rekey would be clearly visible in OpenConnect output.
>> Tony, presumably you'd have seen that and mentioned it?

The openconnect output is very quiet once the tunnel is established.  We reboot the VMs every 24 hours, and during that whole 24 hours, this is the last thing in the log:

Tunnel timeout (rekey interval) is 2880 minutes.
(blah blah)
ESP tunnel connected; exiting HTTPS mainloop.

... and that's seen at bootup.  So since 2880 minutes (48 hours) exceeds the life of the VM, there's not going to be a re-key at all.

>> I think Tony said it was the TCP sequence number, no? That part is
>> mystifying. Unless/until the gateway has successfully decrypted the
>> ESP packet, it should have no idea about the TCP seqno, right?

Yes, it is the TCP sequence number seen in the retransmitted un-VPN'd traffic.  

When we tcpdump the iperf flows, we can see that the original packet containing TCP Sequence number is lost because a new packet with that sequence number arrives much later than it should. 



-----Original Message-----
From: Daniel Lenski [mailto:dlenski@xxxxxxxxx] 
Sent: Monday, March 25, 2019 5:14 AM
To: David Woodhouse
Cc: Phillips, Tony; Nikos Mavrogiannopoulos; openconnect-devel@xxxxxxxxxxxxxxxxxxx
Subject: Re: [EXTERNAL] Re: What throughput is reasonable?

On Mon, Mar 25, 2019 at 11:54 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
>
> On Mon, 2019-03-25 at 11:41 +0200, Daniel Lenski wrote:
> > On Mon, Mar 25, 2019 at 10:29 AM David Woodhouse <dwmw2@xxxxxxxxxxxxx> wrote:
> > >
> > > On Sun, 2019-03-24 at 19:13 +0200, Daniel Lenski wrote:
> > > >
> > > > Do I have this right? High packet loss from client→VPN, low packet
> > > > loss from VPN→client?
> > > >
> > > > If so, I'm guessing your problems are MTU-related.
> > >
> > > Hm, wouldn't we expect that to be more consistent? If the full-sized
> > > packets are getting lost, that would just stall and not lose the
> > > *occasional* packet?
> >
> > Yeah… should be. My guess is based on a couple of previous
> > less-detailed reports from users of earlier versions with GP.
> >
> > > If it really is a repeatable drop every N packets, I might be inclined
> > > to look at sequence numbers and epoch handling. Are we doing any ESP
> > > rekeying?
> >
> > We are rekeying, but only using the most naïve "tunnel rekey" method.
> > AFAIK, that's all that GP supports.
> > https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1153-1157
> >
> > After a certain time has elapsed, we tear down the TLS connection and
> > reconnect (using the same auth cookie), which also invalidates the
> > previous ESP keys and requires us to start using new ones. We should
> > handle late incoming packets using the "old" ESP keys correctly, using
> > the same method as with Juniper.
>
> We might handling late incoming packets correctly, but we stop actually
> sending them. I wonder if we should continue to send ESP packets on the
> "old" connection even while we're doing the reconnect?

Hmmm… I don't think so. I did a whole of testing of the "tap-dance"
required to enable the ESP tunnel early on, and as far as I can tell
there are two main points:

1) As soon as the client config request (POST /ssl-vpn/getconfig.esp)
is received, any pre-existing ESP keys become invalid immediately, and
new ESP keys become valid immediately.
2) If the client ever connects to the TLS tunnel (bogus pseudo-CONNECT
GET request to /ssl-tunnel-connect.sslvpn), the existing ESP keys
immediately become invalid.

Given this, I believe the safe behavior is to disable the UDP
connection entirely before the reconnect starts, and just let the
outgoing packet queue grow. Currently, we're actually *not* disabling
the UDP connection before starting the reconnect
(https://gitlab.com/openconnect/openconnect/blob/v8.02/gpst.c#L1162-1171),
but we probably should be.

Maybe try this patch…?

diff --git a/gpst.c b/gpst.c
index a0dc81f..5cd1aab 100644
--- a/gpst.c
+++ b/gpst.c
@@ -1160,6 +1160,8 @@ int gpst_mainloop(struct openconnect_info
*vpninfo, int *timeout)
         vpn_progress(vpninfo, PRG_ERR,
                  _("GPST Dead Peer Detection detected dead peer!\n"));
     do_reconnect:
+        if (vpninfo->proto->udp_close)
+            vpninfo->proto->udp_close(vpninfo);
         ret = ssl_reconnect(vpninfo);
         if (ret) {
             vpn_progress(vpninfo, PRG_ERR, _("Reconnect failed\n"));


> But a reconnect/rekey would be clearly visible in OpenConnect output.
> Tony, presumably you'd have seen that and mentioned it?

Yeah, there should be a PRG_INFO message on initial connection and
reconnection: "Tunnel timeout (rekey interval) is %d minutes."

Anyway, I kind of doubt reconnect/rekey is playing a role here… all
the real GP VPNs I've heard about have rekey intervals of at least 20
minutes.

> Also, you said that you hit this a repeatable 4142 packets into a TCP
> connection? That was regardless of how long the VPN had been up?

I think Tony said it was the TCP sequence number, no? That part is
mystifying. Unless/until the gateway has successfully decrypted the
ESP packet, it should have no idea about the TCP seqno, right?

Dan
_______________________________________________
openconnect-devel mailing list
openconnect-devel@xxxxxxxxxxxxxxxxxxx
http://lists.infradead.org/mailman/listinfo/openconnect-devel