Re: Squid Ldap Authenticators

guest01 <guest01@xxxxxxxxx> · Tue, 13 Mar 2012 15:54:50 +0100

Hi,

Sorry, I pressed the send button by mistake ...

We are having strange Squid troubles, at first, let me describe our setup:

- 4 HP G6/G7 DL380 servers with 16CPUs and 28GB RAM with RHEL 5.4-5.8
64bit and Squid 3.1.12 (custom compiled)
Squid Cache: Version 3.1.12
configure options:  '--enable-ssl' '--enable-icap-client'
'--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp'
'--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs'
'--enable-removal-policies=heap,lru' '--enable-epoll'
'--disable-ident-lookups' '--enable-truncate'
'--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid'
'--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic
digest ntlm negotiate'
'-enable-negotiate-auth-helpers=squid_kerb_auth'
--with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience

- Each server has two instances for kerberos/ntlm authentication and
two instances for LDAP authentication (different customers)
- we have a hardware loadbalancer which is balancing request for our
kerberos-customers (4x2 instances) and ldap-customers (4x2 instances),
each has a different IP address.
- average load values are approx 0.5 (5min values)
- approx 60RPS per instance (equally distributed -> 16 * 60 => 960 RPS)
- up to 150Mbit/s traffic per server
- ICAP servers for content adaption (multiple servers with a hardware
loadbalancer in front of it)

>From time to time we are having troubles with our Squid servers which
may not be a problem related to Squid, I suspect an OS issue.
Nevertheless, sometimes the servers don't respond to request (even
SSH-requests) or logging in takes forever (reverse lookup failure?) or
even worse, sometimes the server interface is just down (there is no
indication of any problem at the switch port level). If we check the
squidclient output, we can see some hanging ldap authenticators:

squid@xlsqit01 /opt/squid/bin $ ./squidclient -h 10.122.125.23
cache_object://10.122.125.23/basicauthenticator
HTTP/1.0 200 OK
Server: squid/3.1.12
Mime-Version: 1.0
Date: Tue, 13 Mar 2012 13:34:07 GMT
Content-Type: text/plain
Expires: Tue, 13 Mar 2012 13:34:07 GMT
Last-Modified: Tue, 13 Mar 2012 13:34:07 GMT
X-Cache: MISS from xlsqip02_3
Via: 1.0 xlsqip02_3 (squid/3.1.12)
Connection: close

Basic Authenticator Statistics:
program: /opt/squid/libexec/squid_ldap_auth
number active: 20 of 20 (0 shutting down)
requests sent: 13316
replies received: 13312
queue length: 0
avg service time: 4741 msec

      #      FD     PID  # Requests     Flags      Time  Offset Request
      1      12   16038        2150     B       125.885       0 user1 pw1\n
      2      24   16043          85     B       119.562       0 user2 pw2\n
      3      32   16049          63     B        13.639       0 user3 pw3\n
      4      43   16055          21     B       116.143       0 user4 pw4\n
      5      46   16059          12             189.002       0 (none)
      6      50   16064           1             189.003       0 (none)
      7      56   16069           2               0.079       0 (none)
      8      60   16074           0               0.000       0 (none)
      9      65   16079           0               0.000       0 (none)
     10      86   16084           0               0.000       0 (none)
     11      88   16095           0               0.000       0 (none)
     12      90   16101           0               0.000       0 (none)
     13      92   16117           0               0.000       0 (none)
     14      95   16122           0               0.000       0 (none)
     15      97   16130           0               0.000       0 (none)
     16      99   16138           0               0.000       0 (none)
     17     101   16144           0               0.000       0 (none)
     18     104   16150           0               0.000       0 (none)
     19     107   16162           0               0.000       0 (none)
     20     109   16173           0               0.000       0 (none)

Flags key:

   B = BUSY
   W = WRITING
   C = CLOSING
   S = SHUTDOWN PENDING

2012/03/13 03:00:04| Ready to serve requests.
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'
squid_ldap_auth: WARNING, could not bind to binddn 'Can't contact LDAP server'

Testing the ldap authentication at CLI level, it is working without
any problems:

root@xlsqip02 ~ #  /opt/squid/libexec/squid_ldap_auth -b
"dc=squid-proxy" -D "uid=...." -w xxx -h ldaphost -f "(uid=%s)"
user1 pw1
OK

Unfortunately, there is nothing helpful in syslog, e.g.
Mar 13 15:05:19 xlsqip02 last message repeated 2 times
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
libsmb/clientgen.c:cli_receive_smb(111)
Mar 13 15:05:25 xlsqip02 winbindd[4283]:   Receiving SMB: Server
stopped responding
Mar 13 15:05:25 xlsqip02 winbindd[4283]: [2012/03/13 15:05:25, 0]
rpc_client/cli_pipe.c:rpc_api_pipe(790)
Mar 13 15:05:25 xlsqip02 winbindd[4283]:   rpc_api_pipe: Remote
machine wienroot1.wien.rbgat.net pipe \lsarpc fnum 0x4008returned
critical error. Error was Call timed out: server did not respond after
10000 milliseconds
Mar 13 15:05:48 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)
Mar 13 15:06:20 xlsqip02 last message repeated 7 times
Mar 13 15:07:26 xlsqip02 last message repeated 4 times
Mar 13 15:08:27 xlsqip02 last message repeated 4 times
Mar 13 15:09:30 xlsqip02 last message repeated 10 times
Mar 13 15:10:37 xlsqip02 last message repeated 7 times
Mar 13 15:11:39 xlsqip02 last message repeated 11 times
Mar 13 15:12:55 xlsqip02 last message repeated 9 times
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
libsmb/credentials.c:creds_client_check(324)
Mar 13 15:12:57 xlsqip02 winbindd[4331]:   creds_client_check:
credentials check failed.
Mar 13 15:12:57 xlsqip02 winbindd[4331]: [2012/03/13 15:12:57, 0]
rpc_client/cli_netlogon.c:rpccli_netlogon_sam_network_logon(1030)
Mar 13 15:12:57 xlsqip02 winbindd[4331]:
rpccli_netlogon_sam_network_logon: credentials chain check failed
Mar 13 15:13:05 xlsqip02 sockd[4235]: warning: accept(2) failed:
Resource temporarily unavailable (errno = 11)

btw, winbind just sucks ... But I doubt that winbind is the root cause ...

Anyway, we had some NIC issues before (packet drops), at the moment we
disabled all TSO-stuff

root@xlsqip02 ~ # ethtool -k eth0
Offload parameters for eth0:
Cannot get device udp large send offload settings: Operation not supported
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off
generic-receive-offload: off

root@xlsqip02 ~ # ethtool -i eth0
driver: bnx2
version: 1.9.3
firmware-version: 4.6.4 NCSI 1.0.3
bus-info: 0000:02:00.0

root@xlsqip02 ~ # ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             1020
RX Mini:        0
RX Jumbo:       4080
TX:             255
Current hardware settings:
RX:             1020
RX Mini:        0
RX Jumbo:       0
TX:             255

netstat output, if interesting:
root@xlsqip02 ~ # netstat -s
Ip:
    1031106057 total packets received
    32 with invalid addresses
    0 forwarded
    0 incoming packets discarded
    1031105815 incoming packets delivered
    943692708 requests sent out
    214 dropped because of missing route
    34 reassemblies required
    17 packets reassembled ok
Icmp:
    77877 ICMP messages received
    339 input ICMP message failed.
    ICMP input histogram:
        destination unreachable: 31124
        timeout in transit: 3011
        echo requests: 43271
        echo replies: 467
    43804 ICMP messages sent
    0 ICMP messages failed
    ICMP output histogram:
        destination unreachable: 66
        echo request: 467
        echo replies: 43271
IcmpMsg:
        InType0: 467
        InType3: 31124
        InType8: 43271
        InType11: 3011
        OutType0: 43271
        OutType3: 66
        OutType8: 467
Tcp:
    12276761 active connections openings
    15459221 passive connection openings
    10856 failed connection attempts
    875610 connection resets received
    899 connections established
    1027715456 segments received
    937603201 segments send out
    3241669 segments retransmited
    4510 bad segments received.
    1712333 resets sent
Udp:
    2017646 packets received
    13 packets to unknown port received.
    0 packet receive errors
    2804237 packets sent
TcpExt:
    1717 resets received for embryonic SYN_RECV sockets
    4122 packets pruned from receive queue because of socket buffer overrun
    2794 ICMP packets dropped because they were out-of-window
    10477659 TCP sockets finished time wait in fast timer
    105 time wait sockets recycled by time stamp
    15872733 delayed acks sent
    50531 delayed acks further delayed because of locked socket
    Quick ack mode was activated 5087770 times
    742212 packets directly queued to recvmsg prequeue.
    4525346 packets directly received from backlog
    88460270 packets directly received from prequeue
    444903256 packets header predicted
    286109 packets header predicted and directly queued to user
    139821827 acknowledgments not containing data received
    291299405 predicted acknowledgments
    277775 times recovered from packet loss due to fast retransmit
    1646 times recovered from packet loss due to SACK data
    5 bad SACKs received
    Detected reordering 82 times using FACK
    Detected reordering 62 times using SACK
    Detected reordering 1489 times using reno fast retransmit
    Detected reordering 12950 times using time stamp
    13026 congestion windows fully recovered
    83165 congestion windows partially recovered using Hoe heuristic
    TCPDSACKUndo: 5989
    18959 congestion windows recovered after partial ack
    3362 TCP data loss events
    TCPLostRetransmit: 1
    57745 timeouts after reno fast retransmit
    1198 timeouts after SACK recovery
    31172 timeouts in loss state
    473117 fast retransmits
    1478 forward retransmits
    1445453 retransmits in slow start
    888214 other TCP timeouts
    TCPRenoRecoveryFail: 134832
    27 sack retransmits failed
    589052 packets collapsed in receive queue due to low socket buffer
    4622235 DSACKs sent for old packets
    9737 DSACKs sent for out of order packets
    15792 DSACKs received
    1 DSACKs for out of order packets received
    421669 connections reset due to unexpected data
    2977 connections reset due to early user close
    11177 connections aborted due to timeout
IpExt:
    InBcastPkts: 1294824
    OutBcastPkts: 648179

root@xlsqip02 ~ # uptime
 15:43:02 up 6 days, 54 min,  2 users,  load average: 0.12, 0.14, 0.16

Usually, if we get high squid authenticators response time, there is
an issue, e.g.:
http://desmond.imageshack.us/Himg840/scaled.php?server=840&filename=squid3ldapauthenticator.png&res=medium

sysctl-values: http://pastie.org/3586014

Sometime restarting the network helps or rebooting the server, but
during the last couple of days, these issue occur way too often.

Has anybody any idea about what else we could check or experienced
anything similar? Thanks!

regards
Peter

On Tue, Mar 13, 2012 at 3:07 PM, guest01 <guest01@xxxxxxxxx> wrote:
> Hi,
>
> We are having strange Sq