Hi again.Well you intuition wasn't far off. After an uncanny amount of hours of testing different things, I found the solution. (I did learn that for the next time, tcpdump has the first try to find the problem ;)
It seems like the firewall in front of the customer (or perhaps his computer/application) sets the "dont fragment" bit on the packages (I believe that is fairly uncommon?). The package arrives fine at my firewall, but as there are VLANs behind the firewall, the MTU of 1500 is four bytes too long to continue past the firewall. Normally, the package would be fragmented by my firewall (as I understand it) and served. Unfortunately, the DF-flag prevents this, so my firewall bails out with an ICMP type 3 message subtype 4 (unreachable - need to frag). This message seems to get lost at the firewall in front of my customer (of which he has no control). So his application waits for an acceptance of the package which my firewall dropped due to being too big.
The problem has been verified by setting the MTU to 1496 on his computer, after which everything runs flawlessy.
I must admit, that I am not that strong within MTU discovery and ICMP messages (the above is based on what I have read to day and found out by tcpdump).
Here is the relevant piece of tcpdump (81.7.170.138 is my firewall, 81.7.185.77 is a temp IP used for debugging his server (oh, the joy of DNAT ;) and 80.63.34.250 is my clients ip:
14:26:45.691261 IP 80.63.34.250.2831 > 81.7.185.77.22: . 5460:5672(212) ack 305 win 64671 14:26:45.691333 IP 80.63.34.250.2831 > 81.7.185.77.22: P 6588:7100(512) ack 305 win 64671 14:26:45.691386 IP 81.7.185.77.22 > 80.63.34.250.2831: . ack 6588 win 12040 <nop,nop,sack sack 1 {7432:11100} > 14:26:45.691480 IP 81.7.185.77.22 > 80.63.34.250.2831: . ack 7100 win 12800 <nop,nop,sack sack 1 {7432:11100} > 14:26:45.692357 IP 80.63.34.250.2831 > 81.7.185.77.22: P 11100:12560(1460) ack 305 win 64671 14:26:45.692368 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:45.692471 IP 80.63.34.250.2831 > 81.7.185.77.22: P 12560:13920(1360) ack 305 win 64671 14:26:45.692711 IP 81.7.185.77.22 > 80.63.34.250.2831: . ack 7100 win 12800 <nop,nop,sack sack 2 {12560:13920}{7432:11100} > 14:26:45.695955 IP 80.63.34.250.2831 > 81.7.185.77.22: P 13920:15380(1460) ack 305 win 64671 14:26:45.695965 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:46.047387 IP 80.63.34.250.2831 > 81.7.185.77.22: P 7100:8560(1460) ack 305 win 64671 14:26:46.047401 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:46.812538 IP 80.63.34.250.2831 > 81.7.185.77.22: P 7100:8560(1460) ack 305 win 64671 14:26:46.812552 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:48.125581 IP 80.63.34.250.2831 > 81.7.185.77.22: P 7100:8560(1460) ack 305 win 64671 14:26:48.125592 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:50.641188 IP 80.63.34.250.2831 > 81.7.185.77.22: P 7100:8560(1460) ack 305 win 64671 14:26:50.641201 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496) 14:26:55.563161 IP 80.63.34.250.2831 > 81.7.185.77.22: P 7100:8560(1460) ack 305 win 64671 14:26:55.563174 IP 81.7.170.138 > 80.63.34.250: icmp 556: 81.7.185.77 unreachable - need to frag (mtu 1496)The customer offered his NOC my number so that I could tell what I found out, but he declined as "my network is running fine!".
So I am very interested in knowing :- whether or not you agree with me, that this is a problem of the firewall in front of the customer (as opposed to a flaw my setup)? - this could be a potential problem for other people (my mtu of 1496 instead of 1500) ? - if it is his firewall, can I stille use the TCPMSS extension to corretct this problem, and if so, how ?
Thanks in advance Svenne Derick Anderson wrote:
Inline.-----Original Message-----From: netfilter-bounces@xxxxxxxxxxxxxxxxxxx [mailto:netfilter-bounces@xxxxxxxxxxxxxxxxxxx] On Behalf Of Svenne KrapSent: Saturday, December 10, 2005 8:23 AM To: netfilter@xxxxxxxxxxxxxxxxxxx Subject: Very wierd problem Hi. I have quite a problem.One of my customer is suddently unable to upload data to his machine (neither via SFTP/SCP nor regular FTP nor HTTP) behind my firewall. I believe it is due to something changed on the network he is connected to (as I have not changed anything during that period). He has no problems downloading data, but when uploading the upload stalls after 4kb of transfer. What is even worse, I cannot recreate the problem from anywhere I have tried (>5 different ISP's).I interpret this to mean that the problem is with a particular customer on a particular ISP - your other customers can upload just fine. The fact that data can be downloaded (but not uploaded) is very strange. [snip]This has worked flawlessy for half a year or so. But suddently it stop working. The customer's upstream provider blames my firewall. An interesting point is that the customer CAN upload to the firewall itself by scp through it's /29 adress (it has no /26). But as said, I have not changed anything in the way the firewall works around when the problem arose, and any attempt to recreate it has been a failure.Of course they blame your firewall. Did they give a reason? I assume when you attempt to recreate the problem you are uploading to the customer's server on your network and it works fine, and that your other customers are not having problems with similar rulesets. If you haven't changed anything, I would recommend not messing with your firewall. My first urge when there is a problem is to check everything and start changing things which "don't look right". However if there is a system which is the same now as it was when things were working then I stifle that urge and look elsewhere. Maybe there's something with the client's internal server or settings on the VLAN switch... Verify your settings but don't go changing them without a good reason. If you do find something odd change one thing at a time and then test - otherwise you'll never know exactly what the problem was.I have tried to log packets both in the filter tables and the prerouting chain of the nat filter (before doing the nat). But nothing really catches my eyes.Any suggestions to what could be the problem ? Or how I could zero in on it ? What to log and so on?I am not really keen on publishing the firewall script, but I will send it to helpful individuals by email on request.Thanks in advance SvenneI would set up ethereal on your firewall and monitor both sides of a transfer from this client. See who sends the last packet before the connection is dropped. Since this problem appears to be protocol independent, I would pay close attention to the TCP connections, but I would also be curious about HTTP and FTP since they are cleartext and may have some additional information about what might be going on (timeouts, etc.). Derick Anderson
Attachment:
smime.p7s
Description: S/MIME Cryptographic Signature