Hi guys, It seems we solved the problem. Squid was running out of file descriptors, up until recently the limit was at 1024. There were also many log entries: 2013/06/13 14:15:11| client_side.cc(3032) okToAccept: WARNING! Your cache is running out of filedescriptors Checking with squidclient: File descriptor usage for squid: Maximum number of file descriptors: 4096 Largest file desc currently in use: 1644 Number of file desc currently in use: 1263 Files queued for open: 0 Available number of file descriptors: 2833 Reserved number of file descriptors: 100 Store Disk files open: 0 My guess is that there must have been many other connections which did not work because of the lack of file descriptors. Anyway, it seems to work now. Thank you all very much! regards, Peter On Thu, Jun 13, 2013 at 10:30 PM, Eliezer Croitoru <eliezer@xxxxxxxxxxxx> wrote: > Hey, > > Since you are using it only for filtering this seems to me like you are not > using the machine CPU at all using only 4 instances. > you can use more cpu on the same machine with SMP support(and without). > I won't say to you "try" and "experiment" on your clients since it's rude > but since it's only filtering and there is no cache involved you can easily > and smoothly even use just another instance of squid 3.3.5 to test the > problems. > I would give you a nice advice to try 3.3 just to let these monsters make > good use of their cpu. > If i'm not wrong each machine can handle more cpu more connections etc then > they are handling right now. > Again you will need to think about it and plan the migration. > the SMP sometimes is not just out of the box an specifically in your > scenario which is a very loaded server. > > Another issue is the network interface which can slow down things. > If you can use one interface for only ICAP service connection I would go for > it. > Also if you can use more then one interface like in bonding\teaming or fiber > channel I believe that some network issues will be "unavailable" to your > case. > > If you can probe the ICAP service with a simple script it can give you > better indication if the fault is on squid 3.1 problem or ICAP service is > too loaded. > > You can use TCPDUMP to capture one of the many ICAP reqmod request and write > a small nagios like script that will say "ok" "err" and will report them on > mrtg or any other way. > This way you can coordinate find point and shoot to the right > direction(squid or ICAP service). > > This can also be used if you use nagios: > http://exchange.nagios.org/directory/Plugins/Anti-2DVirus/check_icap-2Epl/details > > What monitoring system are you using?nagios?zabbix?munin?Icinga?prtg? > > Thanks, > Eliezer > > > On 6/13/2013 5:22 PM, guest01 wrote: >> >> Hi, >> >> Thanks for your answers. >> >> At the moment, we have 4 "monster"-servers, no indication of any >> performance issues. (there is an extensive munin monitoring) >> >> TCP-states: http://prntscr.com/19qle2 >> CPU: http://prntscr.com/19qltm >> Load: http://prntscr.com/19qlwe >> Vmstat: http://prntscr.com/19qm3v >> Bandwidth: http://prntscr.com/19qmc4 >> >> We have 4 squid instances per server and 4 servers, handling all >> together approx 2000rps without harddisc-caching. Half of them is >> doing kerberos authentication and the other half is doing LDAP >> authentication. Content scanning is done by a couple (6 at the moment) >> of webwasher appliances. These are my cache settings per instance: >> # cache specific settings >> cache_replacement_policy heap LFUDA >> cache_mem 1600 MB >> memory_replacement_policy heap LFUDA >> maximum_object_size_in_memory 2048 KB >> memory_pools off >> cache_swap_low 85 >> cache_swap_high 90 >> >> My plan is to adjust a couple of icap timers and increase icap >> debugging to 93,4 or 93,5) I found these messages: >> 2013/06/13 03:49:42| essential ICAP service is down after an options >> fetch failure: icap://10.122.125.48:1344/wwreqmod [down,!opt] >> 2013/06/13 11:09:33.530| essential ICAP service is suspended: >> icap://10.122.125.48:1344/wwreqmod [down,susp,fail11] >> >> What does down,!opt or down,susp,fail11 mean? >> >> thanks! >> Peter >> >> >> >> On Thu, Jun 13, 2013 at 2:41 AM, Eliezer Croitoru <eliezer@xxxxxxxxxxxx> >> wrote: >>> >>> Hey, >>> >>> There was a bug that is related to LOAD on a server. >>> your server is a monster!! >>> squid 3.1.12 cannot even use the ammount of CPU you have on this machine >>> as >>> far as I can tell from my knowledge unless you have couple clever ideas >>> in >>> your sleeve.(routing marking etc..) >>> >>> To make sure what the problem is I would recommend also to verify the >>> load >>> on the server in a manner of open and half open sessions\connections to >>> squid and icap service\server. >>> Are you using this squid server for filtering only? or also cache? >>> if so what is the cache size? >>> >>> The above questions can help us determine your situation and try to help >>> you >>> verify that the culprit is a specific bug that from my testings on 3.3.5 >>> doesn't exists anymore. >>> if you are up for the task to verify the loads on the server I can tell >>> you >>> it's a 90% go on the bug. >>> What I had was a problem when squid was going over the 900 RPS the ICAP >>> service would go into a mode which stopped responding to requests.(and >>> showed the mentioned screen) >>> This bug was tested on a very slow machine compared to yours. >>> On a monster like yours this effect that I have tested might not appear >>> with >>> the same side effects like "denial of service" but rather "interruption >>> of >>> service" which your monster recover very quickly from. >>> >>> I'm here if you need any assistance, >>> Eliezer >>> >>> >>> On 6/12/2013 4:57 PM, guest01 wrote: >>>> >>>> >>>> Hi guys, >>>> >>>> We are currently using Squid 3.1.12 (old, I know) on RHEL 5.8 64bit >>>> (HP ProLiant DL380 G7 with 16 CPU and 28GB RAM) >>>> Squid Cache: Version 3.1.12 >>>> configure options: '--enable-ssl' '--enable-icap-client' >>>> '--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp' >>>> '--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs' >>>> '--enable-removal-policies=heap,lru' '--enable-epoll' >>>> '--disable-ident-lookups' '--enable-truncate' >>>> '--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid' >>>> '--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic >>>> digest ntlm negotiate' >>>> '-enable-negotiate-auth-helpers=squid_kerb_auth' >>>> --with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience >>>> >>>> As ICAP server, we are using McAfee Webwasher 6.9 (old too, I know). >>>> Up until recently we hardly had problems with this environment. >>>> Squid is doing authentication via Kerberos and passing the username to >>>> the Webwasher, which is doing a LDAP lookup to find the users groups >>>> and assign a policy based on group membership. >>>> We have multiple Squids and multiple Webwasher with a hardware >>>> loadbalancer, approx 15k users. >>>> >>>> Since a couple of weeks, we almost daily get an ICAP server error >>>> message, similar to: >>>> http://support.kaspersky.com/2723 >>>> Unfortunately, I cannot figure out why. In blame the webwasher, but I >>>> am not 100% sure. >>>> >>>> This is my ICAP configuration: >>>> #ICAP >>>> icap_enable on >>>> icap_send_client_ip on >>>> icap_send_client_username on >>>> icap_preview_enable on >>>> icap_preview_size 30 >>>> icap_uses_indirect_client off >>>> icap_persistent_connections on >>>> icap_client_username_encode on >>>> icap_client_username_header X-Authenticated-User >>>> icap_service service_req reqmod_precache bypass=0 >>>> icap://10.122.125.48:1344/wwreqmod >>>> adaptation_access service_req deny favicon >>>> adaptation_access service_req deny to_localhost >>>> adaptation_access service_req deny from_localnet >>>> adaptation_access service_req deny whitelist >>>> adaptation_access service_req deny dst_whitelist >>>> adaptation_access service_req deny icap_bypass_src >>>> adaptation_access service_req deny icap_bypass_dst >>>> adaptation_access service_req allow all >>>> icap_service service_resp respmod_precache bypass=0 >>>> icap://10.122.125.48:1344/wwrespmod >>>> adaptation_access service_resp deny favicon >>>> adaptation_access service_resp deny to_localhost >>>> adaptation_access service_resp deny from_localnet >>>> adaptation_access service_resp deny whitelist >>>> adaptation_access service_resp deny dst_whitelist >>>> adaptation_access service_resp deny icap_bypass_src >>>> adaptation_access service_resp deny icap_bypass_dst >>>> adaptation_access service_resp allow all >>>> >>>> Could an upgrade (either to 3.2 or to 3.3) solve this problem (There >>>> are more icap options in recent squid versions available)? >>>> Unfortunately, this is a rather complex organisational process, that's >>>> why I did not do that yet. >>>> I do have a test machine, but this ICAP error is not reproducible, >>>> only in production. Server load and IO-througput are ok, there is >>>> nothing suspicious on the server. I recently activated icap debug >>>> option 93 and found following message: >>>> 2013/06/12 15:32:15| suspending ICAP service for too many failures >>>> 2013/06/12 15:32:15| essential ICAP service is suspended: >>>> icap://10.122.125.48:1344/wwrespmod [down,susp,fail11] >>>> 2013/06/12 15:35:15| essential ICAP service is up: >>>> icap://10.122.125.48:1344/wwreqmod [up] >>>> 2013/06/12 15:35:15| essential ICAP service is up: >>>> icap://10.122.125.48:1344/wwrespmod [up] >>>> I don't know why this check failed, but it usually does not occur when >>>> clients are getting the icap protocol error page. >>>> >>>> Another possibility would be the ICAP bypass, but our ICAP server is >>>> doing anti-Malware-checking and that's why I don't want to activate >>>> this feature. >>>> >>>> Does anybody have other ideas? >>>> >>>> Thanks! >>>> Peter >>>> >>> >