Re: squid centos and osq_lock

Marcus Kool <marcus.kool@xxxxxxxxxxxxxxx> · Sat, 01 Aug 2015 08:45:39 -0300

On 07/31/2015 03:56 PM, Amos Jeffries wrote:
On 1/08/2015 4:06 a.m., Josip Makarevic wrote:
Marcus, tnx for your info.
OS is centos 6 w kernel  2.6.32-504.30.3.el6.x86_64
Yes, cpu_affinity_map is good and with 6 instances there is load only on
first 6 cores and the server is 12 core, 24 HT

Then I suspect that mutex and locking will be the kernel scheduling work
on the HT cores.
  In high performance Squid will max out a physical cores worth of
cycles. HT essentially tries to over-clock physical cores. But trying to
reach 200% capacity into a physical core with Squid workloads only leads
to trouble.
  It is far better to tie Squid with affinity to one instance per
physical core and let the extra HT capacity be available to the OS and
other supporting things the Squid instance needs to have happen externally.

each instance is bound to 1 core. Instance 1 = core1, instance 2 = core 2
and so on so that should not be the problem.
I've tried with 12 workers but that's even worse.

You do need to be very careful about which core numbers are the HT core
vs the physical core ID. Last time I saw anyone doing it, every second
number was a real physical core ID. YMMV.

There are 2 mappings and I have seen them both but I do not recall which I saw where.
You can do the following to find out which CPU# is a sibling (HT core):
cd /sys/devices/system/cpu
for cpu in cpu[0-9]* ; do
   cat $cpu/topology/thread_siblings_list
done

Let me try to explain:
on non-smp with traffic at ~300mbits we have load of ~4 (on 6 workers).
in that case, actual user time is about 10-20% and 70-80% is sys time
(osq_lock) and there are no connection timeouts.

The CPU time in osq_lock is not easy to explain but it is not likely caused by Squid itself.
Googling about osq_lock led me to a kernel patch discussion where 500 dd processes on ext4/multipath or a file system repair with 125 threads caused the system to use 70+% CPU in osq_lock.
The general believe was that a lot of outstanding IO caused it.
This brings me to these questions:
- what is your testing method ?
- are there simply too many concurrent connections per instance of Squid ?
- are the bonded 10G interfaces supported by CentOS 6 ?
- can you test with unbonded ethernet? (the bonding driver code uses 2 locks)

You may or may not get better results with CentOS 7 or the custom kernel (try latest or before 3.12 since some issues started with 3.12).

If I switch to SMP 6 workers user time goes up but sys time goes up too and
there are connection timeouts and the load jumps to ~12.
If I give it more workers only load jumps and more connections are being
dropped to the point that load goes to 23/24 and the entire server is slow
as hell.

So, best performance so far are with 6 non-smp workers.

'workers' is a term used by Squid SMP.
To have less confusion, in a non-SMP Squid config, I suggest to use the term 'instance'.

Marcus

For now I have 2 options:
1. Install older squid (3.1.10 centos repo) and try it then
2. build custom 64bit kernel with RCU and specific cpu family support (in
progress).

The end idea is to be able to sustain 1gig of traffic on this server :)
Any advice is welcome

I agree with Marcus then. The non-SMP then is the way to go at present.
The main benefit of SMP support in current Squid is for caching
de-duplication (ie rock store).

Also some things to note:

* a good percentage of the speed of Squid is the 20-40% caching HIT rate
normal HTTP traffic has. Albeit memory-only caching on highest
performance boxen. Memory hits are 4-6 orders of magnitude faster than
network fetches. This has little to do with anything you can control
(normally). The (relatively) slow speed of origin servers creating the
content is the bottleneck. Even "static" content may be encoded to the
clients requested desire on each fetch, which takes time.

* Going by out lab tests and real-world results so far I rate Squid
per-worker at ~50Mbps on 3.1GHz core, and ~70Mbps on 3.7GHz. Your 12
cores will only get you up around 800 Mbits IMHO (thats after tuning). I
would gladly be proven wrong though :-)

* Squid effectively *polls* all the listening ports every 10ms or once
every 10 I/O events (whichever is faster). So running with 1024
listening ports is a bit counter-productive, more time could be spent
checking those ports than doing work.
  That said going from one to multiple listening ports does make a speed
improvement. Finding the sweet spot between those trends is something
else to tune for.
  <http://wiki.squid-cache.org/MultipleInstances#Tips>

2015-07-31 14:53 GMT+02:00 Marcus Kool:

osq_lock is used in the kenel for the implementation of a mutex.
It is not clear which mutex so we can only guess.

Which version of the kernel and distro do you use?

Since mutexes are used by Squid SMP, I suggest to switch for now to Squid
non-SMP.

What is the value of cpu_affinity_map in all config files?
You say they are static. But do you allocate each instance on a different
core?
Does 'top' show that all CPUs are used?

Do you have 24 cores or 12 hyperthreaded cores?
In case you have 12 real cores, you might want to experiment with 12
instances of Squid and then try to upscale.

Make maximum_object_size large, a max size of 16K will prohibit the
retrieval of objects larger than 16K.
I am not sure about 'maximum_object_size_in_memory 16 KB' but let it be
infinite and do not worry since
cache_mem is zero.

Marcus

On 07/31/2015 03:52 AM, Josip Makarevic wrote:

Hi Amos,

   cache_mem 0
   cache deny all

already there.
Regarding number of nic ports we have 4 10G eth cards 2 in each bonding
interface.

Well, entire config would be way too long but here is the static part:
via off
cpu_affinity_map process_numbers=1 cores=2
forwarded_for delete
visible_hostname squid1
pid_filename /var/run/squid1.pid

Remove these...

icp_port 0
htcp_port 0
icp_access deny all
htcp_access deny all
snmp_port 0
snmp_access deny all

... to here. They do nothing but slow Squid-3 down.

dns_nameservers x.x.x.x
cache_mem 0
cache deny all
pipeline_prefetch on

In Squid-3.4 and later this is set to the length of pipeline you want to
accept.

NP: 'on' traditionally has meant pipeline length of 1 (two parallel
requests). Longer lengths are not yet well tested but generally it seems
to work okay.

memory_pools off
maximum_object_size 16 KB
maximum_object_size_in_memory 16 KB

Like Marcus said. Without even memory caching these two have no useful
effects.

There is one related setting "read_ahead_gap" which affects performance
by tuning the amount of undelivered object data Squid will buffer in
transient memory. Higher value for that mean faster servers can finish
sending earlier and resources for them released for other uses.
  Tuning this is a fine art since it modulates how much Squid internal
buffers (and pipieline prefetching) read off TCP buffers. And all of
those buffers have limits of their own and may contain multiple requests
data.

ipcache_size 0

Remove this. Without IP cache Squid will be forced to do about 4x remote
DNS lookup for every single HTTP request - *minimum*. Maybe more if you
apply any access controls to the traffic.
  If anything increase the ipcache size to store more results.

cache_store_log none

Not needed in Squid-3. You can remove.

half_closed_clients off
include /etc/squid/rules
access_log /var/log/squid/squid1-access.log

Logging I/O slows Squid down. I suggest making that a daemon, TCP or UDP
log output.

cache_log /var/log/squid/squid1-cache.log
coredump_dir /var/spool/squid/squid1
refresh_pattern ^ftp:           1440    20%     10080
refresh_pattern ^gopher:        1440    0%      1440
refresh_pattern -i (/cgi-bin/|\?) 0     0%      0
refresh_pattern .               0       20%     4320

Without caching you can remove these *entirely*.

acl port0 myport 30000

Mumble. Less reliable than myportname, but it is infintessimally faster
when it does work at all.

http_access allow testhost
tcp_outgoing_address x.x.x.x port0

include is there for basic ACL - safe ports and so on - to minimize
config file footprint since it's static and same for every worker.

and so on 44 more times in this config file

Only put allow testhost once. Every time you test ACLs Squid slows down.

Some ACLs are worse drag than others. You can probably optimize even the
default recommended security settings you shuffled into "rules" file to
operate better.

Do you know of any good article hot to tune kernel locking or have any
idea why is it happening?
I cannot find any good info on it and all I've found are bits and peaces
of kernel source code.

Sorry no. All I found was the same.

Though I do know that one of the big differences between Linux 2.6 and
3.0 was the removal of the "Big Kernel Lock" system that allowed Linux
to run on multi-core systems properly. It could be CentOS 6 itelf biting
you with its ancient kernel version.

Amos
_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
http://lists.squid-cache.org/listinfo/squid-users

_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
http://lists.squid-cache.org/listinfo/squid-users