On 02.04.2014 16:41, David Woodhouse wrote: > On Wed, 2014-04-02 at 15:46 +0200, Ortwin Gl?ck wrote: >> $ strace -ttfp 29785 > > After manually editing that to undo the word-wrapping and make it > readable again, it looks like the two large delays you see in select() > are when it's waiting for data from the tunnel. I'm not sure it can make > that happen faster. The problem here isn't the way the event loop works, > AFAICT. Right, the problem is the way packets come in through tun by select/read. I have just noticed that this is a regression in kernel 3.14! Works fine in 3.12 and 3.13. Sorry to bother you. I will bisect and speak to kernel folks. Rest of the answer for completness. > >> Question remains: why does openconnect's select block for so long? A lot >> of data should be already available in the socket buffer. > > It's not the socket buffer we care about, from the openconnect point of > view. It's packets being "sent" on the tunnel interface. And openconnect > is sending those as soon as they get here... Yes ok, my wrong wording. Call it the send queue. >> Here you go. I have replaced the actual IPs with names for convenience >> and privacy. Timestamps are seconds since previous packet. Note the >> 0.214s delays before local starts sending again after last ACK. > > Hm, I was hoping for timestamps which would correlate with the ones in > the strace run, but I can cope with this... These were not run at the same time anyway. >> No. Time Source Destination Protocol Length Info >> 1100 0.214 tun-local tun-remote SSHv2 1280 Encrypted request packet len=1228 >> 1101 0.000 SSL-local SSL-remote TLSv1 1126 Application Data >> 1102 0.000 SSL-local SSL-remote TLSv1 328 Application Data > > A packet is 'sent' over the tun interface. OpenConnect immediately sends > it. It takes *two* packets to send it, since your MTU inside the tunnel > is larger than your MTU outside the tunnel. No matter what value I pass to --mtu I always end up with 2 packets - just their sizes change. What do you suggest? Anyhow that's certainly not the reason for the performance issue at hand. > Why aren't you using DTLS? Apr 2 16:47:08 localhost openconnect[3195]: DTLS handshake failed: 1 Apr 2 16:47:08 localhost openconnect[3195]: 140213524236008:error:14092073:SSL routines:SSL3_GET_SERVER_HELLO:bad packet length:s3_clnt.c:1063: Reason: I am using openSSL (the required GnuTLS version is too high for the rest of the system). Switching the GnuTLS solves that. But the performance is the same (select() call is unaffected by that). >> 1107 0.214 tun-local tun-remote SSHv2 1280 Encrypted request packet len=1228 > > There is a 214ms delay before the next packet is sent by SSH (or more to > the point the TCP stack). The sequence repeats... Yes, that's what I am saying. >> 1108 0.000 SSL-local SSL-remote TLSv1 1126 Application Data >> 1109 0.000 SSL-local SSL-remote TLSv1 328 Application Data >> 1110 0.007 SSL-remote SSL-local TCP 60 443 > 54547 [ACK] Seq=19851 Ack=205687 Win=32768 Len=0 >> 1111 0.001 SSL-remote SSL-local TLSv1 147 Application Data >> 1112 0.000 SSL-local SSL-remote TCP 54 54547 > 443 [ACK] Seq=205961 Ack=19944 Win=65535 Len=0 >> 1113 0.000 tun-remote tun-local TCP 52 22 > 37376 [ACK] Seq=6189 Ack=186064 Win=64128 Len=0 >> 1114 0.214 tun-local tun-remote SSHv2 1280 Encrypted request packet len=1228 > > Why is it taking 7ms for a pair of TCP packets to reach your VPN server > and then for an ACK to come back for them? Interesting question. Ping to tunnel endpoint is around 0.06 ms (100 times less). > I can't see the sequence numbers for your outgoing packets but I have > strong suspicion that there's a fair amount of data in-flight at any > given time. No, there is only ever around 4000 bytes in flight (looking at the SSL connection), and only ever 1354 bytes (one packet! when looking at the tunnel interface). That's what I am saying: it doesn't saturate the bandwidth at all! You can see by the window size in the ACK from SSL-remote: 32768 and tun-remote: 64128 (higher?!?). It's a complete ping/pong of ACK/data/ACK/data without filling the TCP window. Like its waiting for something. > So it's not that the SSH connection sends a data packet, > gets an ACK within 7ms and then gratuitously waits around for almost > quarter of a second before sending any more data. I suspect it's got > packets outstanding and the TCP stack has determined that this is the > maximum rate it can send at. > > What's the TCP sendq on the offending SSH socket (and the SSL socket for > that matter, but I suspect that'll be low) when this happens? > (netstat -tn) SSH connection: tcp 0 29788 tun-local:46577 tun-remote:22 ESTABLISHED SSL VPN connection: tcp 0 0 SSL-local:55422 SSL-remote:443 ESTABLISHED Again it's what I have been stating: the sendq of SSH is always full (and the VPN side is completely empty of course). But select still takes >200ms to get a signal...