Re: ICAP protocol error

Eliezer Croitoru <eliezer@xxxxxxxxxxxx> · Thu, 13 Jun 2013 23:30:19 +0300

Hey,

Since you are using it only for filtering this seems to me like you are 
not using the machine CPU at all using only 4 instances.
you can use more cpu on the same machine with SMP support(and without).
I won't say to you "try" and "experiment" on your clients since it's 
rude but since it's only filtering and there is no cache involved you 
can easily and smoothly even use just another instance of squid 3.3.5 to 
test the problems.
I would give you a nice advice to try 3.3 just to let these monsters 
make good use of their cpu.
If i'm not wrong each machine can handle more cpu more connections etc 
then they are handling right now.
Again you will need to think about it and plan the migration.
the SMP sometimes is not just out of the box an specifically in your 
scenario which is a very loaded server.

Another issue is the network interface which can slow down things.
If you can use one interface for only ICAP service connection I would go 
for it.
Also if you can use more then one interface like in bonding\teaming or 
fiber channel I believe that some network issues will be "unavailable" 
to your case.

If you can probe the ICAP service with a simple script it can give you 
better indication if the fault is on squid 3.1 problem or ICAP service 
is too loaded.

You can use TCPDUMP to capture one of the many ICAP reqmod request and 
write a small nagios like script that will say "ok" "err" and will 
report them on mrtg or any other way.
This way you can coordinate find point and shoot to the right 
direction(squid or ICAP service).

This can also be used if you use nagios:
http://exchange.nagios.org/directory/Plugins/Anti-2DVirus/check_icap-2Epl/details

What monitoring system are you using?nagios?zabbix?munin?Icinga?prtg?

Thanks,
Eliezer

On 6/13/2013 5:22 PM, guest01 wrote:
Hi,

Thanks for your answers.

At the moment, we have 4 "monster"-servers, no indication of any
performance issues. (there is an extensive munin monitoring)

TCP-states: http://prntscr.com/19qle2
CPU: http://prntscr.com/19qltm
Load: http://prntscr.com/19qlwe
Vmstat: http://prntscr.com/19qm3v
Bandwidth: http://prntscr.com/19qmc4

We have 4 squid instances per server and 4 servers, handling all
together approx 2000rps without harddisc-caching. Half of them is
doing kerberos authentication and the other half is doing LDAP
authentication. Content scanning is done by a couple (6 at the moment)
of webwasher appliances. These are my cache settings per instance:
# cache specific settings
cache_replacement_policy heap LFUDA
cache_mem 1600 MB
memory_replacement_policy heap LFUDA
maximum_object_size_in_memory 2048 KB
memory_pools off
cache_swap_low 85
cache_swap_high 90

My plan is to adjust a couple of icap timers and increase icap
debugging to 93,4 or 93,5) I found these messages:
2013/06/13 03:49:42| essential ICAP service is down after an options
fetch failure: icap://10.122.125.48:1344/wwreqmod [down,!opt]
2013/06/13 11:09:33.530| essential ICAP service is suspended:
icap://10.122.125.48:1344/wwreqmod [down,susp,fail11]

What does down,!opt or down,susp,fail11 mean?

thanks!
Peter

On Thu, Jun 13, 2013 at 2:41 AM, Eliezer Croitoru <eliezer@xxxxxxxxxxxx> wrote:
Hey,

There was a bug that is related to LOAD on a server.
your server is a monster!!
squid 3.1.12 cannot even use the ammount of CPU you have on this machine as
far as I can tell from my knowledge unless you have couple clever ideas in
your sleeve.(routing marking etc..)

To make sure what the problem is I would recommend also to verify the load
on the server in a manner of open and half open sessions\connections to
squid and icap service\server.
Are you using this squid server for filtering only? or also cache?
if so what is the cache size?

The above questions can help us determine your situation and try to help you
verify that the culprit is a specific bug that from my testings on 3.3.5
doesn't exists anymore.
if you are up for the task to verify the loads on the server I can tell you
it's a 90% go on the bug.
What I had was a problem when squid was going over the 900 RPS the ICAP
service would go into a mode which stopped responding to requests.(and
showed the mentioned screen)
This bug was tested on a very slow machine compared to yours.
On a monster like yours this effect that I have tested might not appear with
the same side effects like "denial of service"  but rather "interruption of
service" which your monster recover very quickly from.

I'm here if you need any assistance,
Eliezer

On 6/12/2013 4:57 PM, guest01 wrote:

Hi guys,

We are currently using Squid 3.1.12 (old, I know) on RHEL 5.8 64bit
(HP ProLiant DL380 G7 with 16 CPU and 28GB RAM)
Squid Cache: Version 3.1.12
configure options:  '--enable-ssl' '--enable-icap-client'
'--sysconfdir=/etc/squid' '--enable-async-io' '--enable-snmp'
'--enable-poll' '--with-maxfd=32768' '--enable-storeio=aufs'
'--enable-removal-policies=heap,lru' '--enable-epoll'
'--disable-ident-lookups' '--enable-truncate'
'--with-logdir=/var/log/squid' '--with-pidfile=/var/run/squid.pid'
'--with-default-user=squid' '--prefix=/opt/squid' '--enable-auth=basic
digest ntlm negotiate'
'-enable-negotiate-auth-helpers=squid_kerb_auth'
--with-squid=/home/squid/squid-3.1.12 --enable-ltdl-convenience

As ICAP server, we are using McAfee Webwasher 6.9 (old too, I know).
Up until recently we hardly had problems with this environment.
Squid is doing authentication via Kerberos and passing the username to
the Webwasher, which is doing a LDAP lookup to find the users groups
and assign a policy based on group membership.
We have multiple Squids and multiple Webwasher with a hardware
loadbalancer, approx 15k users.

Since a couple of weeks, we almost daily get an ICAP server error
message, similar to:
http://support.kaspersky.com/2723
Unfortunately, I cannot figure out why. In blame the webwasher, but I
am not 100% sure.

This is my ICAP configuration:
#ICAP
icap_enable on
icap_send_client_ip on
icap_send_client_username on
icap_preview_enable on
icap_preview_size 30
icap_uses_indirect_client off
icap_persistent_connections on
icap_client_username_encode on
icap_client_username_header X-Authenticated-User
icap_service service_req reqmod_precache bypass=0
icap://10.122.125.48:1344/wwreqmod
adaptation_access service_req deny favicon
adaptation_access service_req deny to_localhost
adaptation_access service_req deny from_localnet
adaptation_access service_req deny whitelist
adaptation_access service_req deny dst_whitelist
adaptation_access service_req deny icap_bypass_src
adaptation_access service_req deny icap_bypass_dst
adaptation_access service_req allow all
icap_service service_resp respmod_precache bypass=0
icap://10.122.125.48:1344/wwrespmod
adaptation_access service_resp deny favicon
adaptation_access service_resp deny to_localhost
adaptation_access service_resp deny from_localnet
adaptation_access service_resp deny whitelist
adaptation_access service_resp deny dst_whitelist
adaptation_access service_resp deny icap_bypass_src
adaptation_access service_resp deny icap_bypass_dst
adaptation_access service_resp allow all

Could an upgrade (either to 3.2 or to 3.3) solve this problem (There
are more icap options in recent squid versions available)?
Unfortunately, this is a rather complex organisational process, that's
why I did not do that yet.
I do have a test machine, but this ICAP error is not reproducible,
only in production. Server load and IO-througput are ok, there is
nothing suspicious on the server. I recently activated icap debug
option 93 and found following message:
2013/06/12 15:32:15| suspending ICAP service for too many failures
2013/06/12 15:32:15| essential ICAP service is suspended:
icap://10.122.125.48:1344/wwrespmod [down,susp,fail11]
2013/06/12 15:35:15| essential ICAP service is up:
icap://10.122.125.48:1344/wwreqmod [up]
2013/06/12 15:35:15| essential ICAP service is up:
icap://10.122.125.48:1344/wwrespmod [up]
I don't know why this check failed, but it usually does not occur when
clients are getting the icap protocol error page.

Another possibility would be the ICAP bypass, but our ICAP server is
doing anti-Malware-checking and that's why I don't want to activate
this feature.

Does anybody have other ideas?

Thanks!
Peter