Hi, Sorry, I pressed the send button by mistake ... We are having strange Squid troubles, at first, let me describe our setup: - 4 HP G6/G7 DL380 servers with 16CPUs and 28GB RAM with RHEL 5.4-5.8 64bit and Squid 3.1.12 (custom compiled) Squid Cache: Version 3.1.12 configure options: '--enable-ssl' '--enable-icap-client' '--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp' '--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs' '--enable-removal-policies=heap,lru' '--enable-epoll' '--disable-ident-lookups' '--enable-truncate' '--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid' '--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic digest ntlm negotiate' '-enable-negotiate-auth-helpers=squid_kerb_auth' --with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience - Each server has two instances for kerberos/ntlm authentication and two instances for LDAP authentication (different customers) - we have a hardware loadbalancer which is balancing request for our kerberos-customers (4x2 instances) and ldap-customers (4x2 instances), each has a different IP address. - average load values are approx 0.5 (5min values) - approx 60RPS per instance (equally distributed -> 16 * 60 => 960 RPS) - up to 150Mbit/s traffic per server - ICAP servers for content adaption (multiple servers with a hardware loadbalancer in front of it) >From time to time we are having troubles with our Squid servers which may not be a problem related to Squid, I suspect an OS issue. Nevertheless, sometimes the servers don't respond to request (even SSH-requests) or logging in takes forever (reverse lookup failure?) or even worse, sometimes the server interface is just down (there is no indication of any problem at the switch port level). If we check the squidclient output, we can see some hanging ldap authenticators: squid@xlsqit01 /opt/squid/bin $ ./squidclient -h 10.122.125.23 cache_object://10.122.125.23/basicauthenticator HTTP/1.0 200 OK Server: squid/3.1.12 Mime-Version: 1.0 Date: Tue, 13 Mar 2012 13:34:07 GMT Content-Type: text/plain Expires: Tue, 13 Mar 2012 13:34:07 GMT Last-Modified: Tue, 13 Mar 2012 13:34:07 GMT X-Cache: MISS from xlsqip02_3 Via: 1.0 xlsqip02_3 (squid/3.1.12) Connection: close Basic Authenticator Statistics: program: /opt/squid/libexec/squid_ldap_auth number active: 20 of 20 (0 shutting down) requests sent: 13316 replies received: 13312 queue length: 0 avg service time: 4741 msec # FD PID # Requests Flags Time Offset Request 1 12 16038 2150 B 125.885 0 user1 pw1\n 2 24 16043 85 B 119.562 0 user2 pw2\n 3 32 16049 63 B 13.639 0 user3 pw3\n 4 43 16055 21 B 116.143 0 user4 pw4\n 5 46 16059 12 189.002 0 (none) 6 50 16064 1 189.003 0 (none) 7 56 16069 2 0.079 0 (none) 8 60 16074 0 0.000 0 (none) 9 65 16079 0 0.000 0 (none) 10 86 16084 0 0.000 0 (none) 11 88 16095 0 0.000 0 (none) 12 90 16101 0 0.000 0 (none) 13 92 16117 0 0.000 0 (none) 14 95 16122 0 0.000 0 (none) 15 97 16130 0 0.000 0 (none) 16 99 16138 0 0.000 0 (none) 17 101 16144 0 0.000 0 (none) 18 104 16150 0 0.000 0 (none) 19 107 16162 0 0.000 0 (none) 20 109 16173 0 0.000 0 (none) Flags key: B = BUSY W = WRITING C = CLOSING S = SHUTDOWN PENDING 2012/03/13 03:00:04| Ready to serve requests. squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server' Testing the ldap authentication at CLI level, it is working without any problems: root@xlsqip02 ~ # /opt/squid/libexec/squid_ldap_auth -b "dc=squid-proxy" -D "uid=...." -w xxx -h ldaphost -f "(uid=%s)" user1 pw1 OK Unfortunately, there is nothing helpful in syslog, e.g. Mar 13 15:05:19 xlsqip02 last message repeated 2 times Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0] libsmb/clientgen.c:cli_receive_smb(111) Mar 13 15:05:25 xlsqip02 winbindd[4283]: Receiving SMB: Server stopped responding Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0] rpc_client/cli_pipe.c:rpc_api_pipe(790) Mar 13 15:05:25 xlsqip02 winbindd[4283]: rpc_api_pipe: Remote machine wienroot1.wien.rbgat.net pipe \lsarpc fnum 0x4008returned critical error. Error was Call timed out: server did not respond after 10000 milliseconds Mar 13 15:05:48 xlsqip02 sockd[4235]: warning: accept(2) failed: Resource temporarily unavailable (errno = 11) Mar 13 15:06:20 xlsqip02 last message repeated 7 times Mar 13 15:07:26 xlsqip02 last message repeated 4 times Mar 13 15:08:27 xlsqip02 last message repeated 4 times Mar 13 15:09:30 xlsqip02 last message repeated 10 times Mar 13 15:10:37 xlsqip02 last message repeated 7 times Mar 13 15:11:39 xlsqip02 last message repeated 11 times Mar 13 15:12:55 xlsqip02 last message repeated 9 times Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0] libsmb/credentials.c:creds_client_check(324) Mar 13 15:12:57 xlsqip02 winbindd[4331]: creds_client_check: credentials check failed. Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0] rpc_client/cli_netlogon.c:rpccli_netlogon_sam_network_logon(1030) Mar 13 15:12:57 xlsqip02 winbindd[4331]: rpccli_netlogon_sam_network_logon: credentials chain check failed Mar 13 15:13:05 xlsqip02 sockd[4235]: warning: accept(2) failed: Resource temporarily unavailable (errno = 11) btw, winbind just sucks ... But I doubt that winbind is the root cause ... Anyway, we had some NIC issues before (packet drops), at the moment we disabled all TSO-stuff root@xlsqip02 ~ # ethtool -k eth0 Offload parameters for eth0: Cannot get device udp large send offload settings: Operation not supported rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off generic-receive-offload: off root@xlsqip02 ~ # ethtool -i eth0 driver: bnx2 version: 1.9.3 firmware-version: 4.6.4 NCSI 1.0.3 bus-info: 0000:02:00.0 root@xlsqip02 ~ # ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 1020 RX Mini: 0 RX Jumbo: 4080 TX: 255 Current hardware settings: RX: 1020 RX Mini: 0 RX Jumbo: 0 TX: 255 netstat output, if interesting: root@xlsqip02 ~ # netstat -s Ip: 1031106057 total packets received 32 with invalid addresses 0 forwarded 0 incoming packets discarded 1031105815 incoming packets delivered 943692708 requests sent out 214 dropped because of missing route 34 reassemblies required 17 packets reassembled ok Icmp: 77877 ICMP messages received 339 input ICMP message failed. ICMP input histogram: destination unreachable: 31124 timeout in transit: 3011 echo requests: 43271 echo replies: 467 43804 ICMP messages sent 0 ICMP messages failed ICMP output histogram: destination unreachable: 66 echo request: 467 echo replies: 43271 IcmpMsg: InType0: 467 InType3: 31124 InType8: 43271 InType11: 3011 OutType0: 43271 OutType3: 66 OutType8: 467 Tcp: 12276761 active connections openings 15459221 passive connection openings 10856 failed connection attempts 875610 connection resets received 899 connections established 1027715456 segments received 937603201 segments send out 3241669 segments retransmited 4510 bad segments received. 1712333 resets sent Udp: 2017646 packets received 13 packets to unknown port received. 0 packet receive errors 2804237 packets sent TcpExt: 1717 resets received for embryonic SYN_RECV sockets 4122 packets pruned from receive queue because of socket buffer overrun 2794 ICMP packets dropped because they were out-of-window 10477659 TCP sockets finished time wait in fast timer 105 time wait sockets recycled by time stamp 15872733 delayed acks sent 50531 delayed acks further delayed because of locked socket Quick ack mode was activated 5087770 times 742212 packets directly queued to recvmsg prequeue. 4525346 packets directly received from backlog 88460270 packets directly received from prequeue 444903256 packets header predicted 286109 packets header predicted and directly queued to user 139821827 acknowledgments not containing data received 291299405 predicted acknowledgments 277775 times recovered from packet loss due to fast retransmit 1646 times recovered from packet loss due to SACK data 5 bad SACKs received Detected reordering 82 times using FACK Detected reordering 62 times using SACK Detected reordering 1489 times using reno fast retransmit Detected reordering 12950 times using time stamp 13026 congestion windows fully recovered 83165 congestion windows partially recovered using Hoe heuristic TCPDSACKUndo: 5989 18959 congestion windows recovered after partial ack 3362 TCP data loss events TCPLostRetransmit: 1 57745 timeouts after reno fast retransmit 1198 timeouts after SACK recovery 31172 timeouts in loss state 473117 fast retransmits 1478 forward retransmits 1445453 retransmits in slow start 888214 other TCP timeouts TCPRenoRecoveryFail: 134832 27 sack retransmits failed 589052 packets collapsed in receive queue due to low socket buffer 4622235 DSACKs sent for old packets 9737 DSACKs sent for out of order packets 15792 DSACKs received 1 DSACKs for out of order packets received 421669 connections reset due to unexpected data 2977 connections reset due to early user close 11177 connections aborted due to timeout IpExt: InBcastPkts: 1294824 OutBcastPkts: 648179 root@xlsqip02 ~ # uptime 15:43:02 up 6 days, 54 min, 2 users, load average: 0.12, 0.14, 0.16 Usually, if we get high squid authenticators response time, there is an issue, e.g.: http://desmond.imageshack.us/Himg840/scaled.php?server=840&filename=squid3ldapauthenticator.png&res=medium sysctl-values: http://pastie.org/3586014 Sometime restarting the network helps or rebooting the server, but during the last couple of days, these issue occur way too often. Has anybody any idea about what else we could check or experienced anything similar? Thanks! regards Peter On Tue, Mar 13, 2012 at 3:07 PM, guest01 <guest01@xxxxxxxxx> wrote: > Hi, > > We are having strange Sq