Re: Exchange 2010 and 502 Bad Gateway

Amos Jeffries <squid3@xxxxxxxxxxxxx> · Fri, 23 Aug 2013 21:33:42 +1200

On 23/08/2013 8:18 p.m., Bill Houle wrote:
For the next in my continuing Exchange saga, let's talk 502 errors. 
I've got a couple different instances.

1) ActiveSync sends periodic 'Ping' requests to implement its "server 
push" feature. If I understand the process correctly, the client sends 
an empty (Content-Length: 0) keep-alive HTTP request and tries to see 
how long the server+network honor the session.

potential problem #1: what type of keep-alive request? the old HTTP/1.0 
"Keep-Alive:" header is deprecated, not supported by Squid and does not 
actually work most places anyway. Simply opening a TCP connection and 
waiting after the first ping request until it closes is a terrible thing 
to test it.

It uses a back-off algorithm to eventually settle on a timing value 
that it knows the network can support: if the keep-alive expires 
cleanly, they up the ante and repeat; if the HTTP session aborts, they 
drop it down to the previous success and lock in the refresh rate. 
From that point forward, they've got a sync window and continue to 
issue Pings at that duration. That way, if the Ping aborts, it is a 
signal that a 'Sync' is needed because "server push" has new data.

potential problem #2: are they using HTTP/1.1 1xx status codes from the 
server as this sync ping or HTTP/1.0 simple request/reply pairs?
Squid older than 3.2 do not support the 1xx status response. So is there 
any HTTP/1.0 software along the network path? (including Squid up to 
version 3.1).

What I'm actually seeing is that the system is never able to settle on 
a consistent keep-alive sync window as MS might like. The Ping, or 
string of Pings, might last minutes or could only be seconds. When the 
Ping ultimately fails, the system does a Sync even though there may be 
nothing new. The end result is that it is less like "server push" and 
more like polling at a variable rate.

This is where we come back to the whole design of this being a terrible 
way to operate.
They are trying to measure the unbalanced cycles of TCP socket timeout 
on every box along the pathway, NAT record timeout on every NAT relay 
along the pathway, idle connection timeout on every proxy along the 
pathway. Simultaneously.

The users don't really notice or care since they still get their 
updates promptly. It's hardly catastrophic for me, but I could 
envision that the variable-polling behavior might be slightly more 
taxing as the number of users scale upward. But I'm curious if there's 
any Squid debug I can add that might reveal why the session durations 
seem to vary so much? At 11,2 level, the only thing I see is:

2013/08/19 00:46:51 kid1| WARNING: HTTP: Invalid Response: No object 
data received 
forhttps://mail.domain.com/Microsoft-Server-ActiveSync?User=user&DeviceId=ApplF4KKR4GLF199&DeviceType=iPad&Cmd=Ping 
AKA 
mail.domain.com/Microsoft-Server-ActiveSync?User=user&DeviceId=ApplF4KKR4GLF199&DeviceType=iPad&Cmd=Ping

To which Squid replies back to the client as 502 Bad Gateway. 
X-Squid-Error is ERR_ZERO_SIZE_OBJECT.

It will be more taxing as the numbers of users increase. These 
connections are long-term, blocked from use by the client end, and 
reserving 2 TCP sockets and an 1 disk FD on the proxy for every connection.

No there is no easy way to debug why the variance in connection length 
exists. You need wireshark or similar with a packet trace to identify 
where the close is coming from. that Squid message indicates that 
something between Squid and the server is cutting the connection.

2) Next problem is OWA (WebMail). OWA is designed to mimic Outlook, so 
if Outlook can support 10Meg attachments, so can OWA. A user tries to 
send a large attachment. Unlike the ActiveSync problem I previously 
posted about, UploadReadAhead does not seem to enter into the equation 
- possibly because the POST is redirected to an /EWS/ proxy. It 
happily chunks well past the ActiveSync threshold, but at some point 
the connection may still fail:

2013/08/21 07:41:07.616 kid1| http.cc(1172) readReply: 
local=proxy.IP:42891 remote=Exchange.IP:443 FD 39 flags=1: read 
failure: (32) Broken pipe.

To which Squid replies back to the client as 502 Bad Gateway. 
X-Squid-Error is ERR_READ_ERROR 104.

I know Squid doesn't touch the data, and thus doesn't care about 
transaction size. But is there anything more I can do to minimize all 
possible drops & connection timeouts, particularly with large POSTs? 
I'm not saying the drops are Squid's fault, I just want to idiot-proof 
the setup on this end as much as possible.

This sounds like a bug in Exchange itself. The HTTP protocol offers 
chunked encoding to get around this type of error and Squid will be 
sending it whenever necessary and possible. But that relies on the other 
end working right. There is nothing that can be done about POST if the 
server is broken.

3) Final example is RPC-over-HTTPS.  I routinely see 502s on 
"connection reset by peer" (RSTs seem to be par for the course on 
Windows systems). But I've also seen ERR_READ_ERROR 104 on a "No 
error" error.

2013/08/19 21:09:37.239 kid1| http.cc(1172) readReply: 
local=proxy.IP:58798 remote=Exchange.IP:443 FD 44 flags=1: read 
failure: (0) No error..

What could this possibly indicate?

Strange but no unheard of. Something in the asynchronous even handling 
overwrote the global error detail before Squid could pick it up.

Amos