On 14.03.2012 03:54, guest01 wrote:
Hi,
Sorry, I pressed the send button by mistake ...
We are having strange Squid troubles, at first, let me describe our
setup:
- 4 HP G6/G7 DL380 servers with 16CPUs and 28GB RAM with RHEL 5.4-5.8
64bit and Squid 3.1.12 (custom compiled)
Squid Cache: Version 3.1.12
configure options: '--enable-ssl' '--enable-icap-client'
'--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp'
'--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs'
'--enable-removal-policies=heap,lru' '--enable-epoll'
'--disable-ident-lookups' '--enable-truncate'
'--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid'
'--with-default-user=squid' '--prefix=/opt/squid'
'--enable-auth=basic
digest ntlm negotiate'
'-enable-negotiate-auth-helpers=squid_kerb_auth'
--with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience
- Each server has two instances for kerberos/ntlm authentication and
two instances for LDAP authentication (different customers)
- we have a hardware loadbalancer which is balancing request for our
kerberos-customers (4x2 instances) and ldap-customers (4x2
instances),
each has a different IP address.
- average load values are approx 0.5 (5min values)
- approx 60RPS per instance (equally distributed -> 16 * 60 => 960
RPS)
- up to 150Mbit/s traffic per server
- ICAP servers for content adaption (multiple servers with a hardware
loadbalancer in front of it)
From time to time we are having troubles with our Squid servers which
may not be a problem related to Squid, I suspect an OS issue.
Nevertheless, sometimes the servers don't respond to request (even
SSH-requests) or logging in takes forever (reverse lookup failure?)
or
even worse, sometimes the server interface is just down (there is no
indication of any problem at the switch port level). If we check the
squidclient output, we can see some hanging ldap authenticators:
squid@xlsqit01 /opt/squid/bin $ ./squidclient -h 10.122.125.23
cache_object://10.122.125.23/basicauthenticator
HTTP/1.0 200 OK
Server: squid/3.1.12
Mime-Version: 1.0
Date: Tue, 13 Mar 2012 13:34:07 GMT
Content-Type: text/plain
Expires: Tue, 13 Mar 2012 13:34:07 GMT
Last-Modified: Tue, 13 Mar 2012 13:34:07 GMT
X-Cache: MISS from xlsqip02_3
Via: 1.0 xlsqip02_3 (squid/3.1.12)
Connection: close
Basic Authenticator Statistics:
program: /opt/squid/libexec/squid_ldap_auth
number active: 20 of 20 (0 shutting down)
requests sent: 13316
replies received: 13312
queue length: 0
avg service time: 4741 msec
# FD PID # Requests Flags Time Offset
Request
1 12 16038 2150 B 125.885 0 user1
pw1\n
2 24 16043 85 B 119.562 0 user2
pw2\n
3 32 16049 63 B 13.639 0 user3
pw3\n
4 43 16055 21 B 116.143 0 user4
pw4\n
5 46 16059 12 189.002 0
(none)
6 50 16064 1 189.003 0
(none)
7 56 16069 2 0.079 0
(none)
8 60 16074 0 0.000 0
(none)
9 65 16079 0 0.000 0
(none)
10 86 16084 0 0.000 0
(none)
11 88 16095 0 0.000 0
(none)
12 90 16101 0 0.000 0
(none)
13 92 16117 0 0.000 0
(none)
14 95 16122 0 0.000 0
(none)
15 97 16130 0 0.000 0
(none)
16 99 16138 0 0.000 0
(none)
17 101 16144 0 0.000 0
(none)
18 104 16150 0 0.000 0
(none)
19 107 16162 0 0.000 0
(none)
20 109 16173 0 0.000 0
(none)
Looks like you can save some resources by dropping that down to 10
helpers. But re-evaluate that after they are fixed in case the loading
goes up after that.
Flags key:
B = BUSY
W = WRITING
C = CLOSING
S = SHUTDOWN PENDING
2012/03/13 03:00:04| Ready to serve requests.
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact
LDAP server'
Testing the ldap authentication at CLI level, it is working without
any problems:
root@xlsqip02 ~ # /opt/squid/libexec/squid_ldap_auth -b
"dc=squid-proxy" -D "uid=...." -w xxx -h ldaphost -f "(uid=%s)"
user1 pw1
OK
Unfortunately, there is nothing helpful in syslog, e.g.
Mar 13 15:05:19 xlsqip02 last message repeated 2 times
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
libsmb/clientgen.c:cli_receive_smb(111)
Mar 13 15:05:25 xlsqip02 winbindd[4283]: Receiving SMB: Server
stopped responding
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
rpc_client/cli_pipe.c:rpc_api_pipe(790)
Mar 13 15:05:25 xlsqip02 winbindd[4283]: rpc_api_pipe: Remote
machine wienroot1.wien.rbgat.net pipe \lsarpc fnum 0x4008returned
critical error. Error was Call timed out: server did not respond
after
10000 milliseconds
What does the domain "wienroot1.wien.rbgat.net" resolve to?
Is connectivity to all its IPs working?
Looks a lot like network congestion affecting SMB. Or possibly route
up/down connectivity issues for IP (v4? v6?).
Winbind has some nasty limitations, but should not be hitting this type
of problem.
Mar 13 15:05:48 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)
Mar 13 15:06:20 xlsqip02 last message repeated 7 times
Mar 13 15:07:26 xlsqip02 last message repeated 4 times
Mar 13 15:08:27 xlsqip02 last message repeated 4 times
Mar 13 15:09:30 xlsqip02 last message repeated 10 times
Mar 13 15:10:37 xlsqip02 last message repeated 7 times
Mar 13 15:11:39 xlsqip02 last message repeated 11 times
Mar 13 15:12:55 xlsqip02 last message repeated 9 times
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
libsmb/credentials.c:creds_client_check(324)
Mar 13 15:12:57 xlsqip02 winbindd[4331]: creds_client_check:
credentials check failed.
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
rpc_client/cli_netlogon.c:rpccli_netlogon_sam_network_logon(1030)
Mar 13 15:12:57 xlsqip02 winbindd[4331]:
rpccli_netlogon_sam_network_logon: credentials chain check failed
Mar 13 15:13:05 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)
btw, winbind just sucks ... But I doubt that winbind is the root
cause ...
Right. Something underneath it is. Affecting both winbind and
squid_ldap_auth connectivity. Possibly routing related.
Anyway, we had some NIC issues before (packet drops), at the moment
we
disabled all TSO-stuff
root@xlsqip02 ~ # ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not
supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off
root@xlsqip02 ~ # ethtool -i eth0
driver: bnx2
version: 1.9.3
firmware-version: 4.6.4 NCSI 1.0.3
bus-info: 0000:02:00.0
root@xlsqip02 ~ # ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX: 1020
RX Mini: 0
RX Jumbo: 4080
TX: 255
Current hardware settings:
RX: 1020
RX Mini: 0
RX Jumbo: 0
TX: 255
netstat output, if interesting:
root@xlsqip02 ~ # netstat -s
Ip:
1031106057 total packets received
32 with invalid addresses
0 forwarded
0 incoming packets discarded
1031105815 incoming packets delivered
943692708 requests sent out
214 dropped because of missing route
Possibly related.
34 reassemblies required
17 packets reassembled ok
Icmp:
77877 ICMP messages received
339 input ICMP message failed.
ICMP input histogram:
destination unreachable: 31124
unreachable is way too high. The NIC is either going down
intermittently or a route has disappeared for some destinations.
timeout in transit: 3011
echo requests: 43271
echo replies: 467
43804 ICMP messages sent
0 ICMP messages failed
ICMP output histogram:
destination unreachable: 66
echo request: 467
echo replies: 43271
Amos