Re: very poor performance of rock cache ipc

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Sat, 14 Oct 2023 23:42:39 -0400

On 2023-10-14 12:04, Julian Taylor wrote:
On 14.10.23 17:40, Alex Rousskov wrote:
On 2023-10-13 16:01, Julian Taylor wrote:

When using squid for caching using the rock cache_dir setting the 
performance is pretty poor with multiple workers.
The reason for this is due to the very high number of systemcalls 
involved in the IPC between the disker and workers.

Please allow me to rephrase your conclusion to better match (expected) 
reality and avoid misunderstanding:

By design, a mostly idle SMP Squid should use a lot more system calls 
per disk cache hit than a busy SMP Squid would:

* Mostly idle Squid: Every disk I/O may require a few IPC messages.
* Busy Squid: Bugs notwithstanding, disk I/Os require no IPC messages.

In your single-request test, you are observing the expected effects 
described in the first bullet. That does not imply those effects are 
"good" or "desirable" in your use case, of course. It only means that 
SMP Squid was no optimized for that use case; SMP rock design was 
explicitly targeting the opposite use case (i.e. a busy Squid).

The reproducer uses as single request, the same very thing can be 
observed on a very busy squid

If a busy Squid sends lots of IPC messages between worker and disker, 
then either there is a Squid bug we do not know about OR that disker is 
just not as busy as one might expect it to be.

In Squid v6+, you can observe disker queues using mgr:store_queues cache 
manager report. In your environment, do those queues always have lots of 
requests when Squid is busy? Feel free to share (a pointer to) a 
representative sample of those reports from your busy Squid.

N.B. Besides worker-disker IPC messages, there are also worker-worker 
cache synchronization IPC messages. They also have the same "do not send 
IPC messages if the queue has some pending items already" optimization.

and workaround improves both the single 
request case and the actual heavy loaded production squid in the same way.

FWIW, I do not think that observation contradicts anything I have said.

The hardware involved has a 10G card, not ssds but lots of ram so it has 
a very high page cache hit rate and the squid is very busy, so much it 
is overloaded by system cpu usage in default configuration with the rock 
cache. The network or disk bandwidth is barely ever utilized more than 
10% with all 8 cpus busy on system load.

The above facts suggest that the disk is just not used much OR there is 
a bug somewhere. Slower (for any reason, including CPU overload) IPC 
messages should lead to longer queues and the disappearance of "your 
queue is no longer empty!" IPC messages.

The only way to get the squid to utilize the machine is to increase the 
IO size via the request buffer change or not use the rock cache. UFS 
cache works ok in comparison, but requires multiple independent squid 
instances as it does not support SMP.

Increasing the IO size to 32KiB as I mentioned does allow the squid 
workers to utilize a good 60% of the hardware network and disk 
capabilities.

Please note that I am not disputing this observation. Unfortunately, it 
does not help me guess where the actual/core problem or bottleneck is. 
Hopefully, cache manager mgr:store_queues report will shed some light.

Roughly speaking, here, "busy" means "there are always some messages 
in the disk I/O queue [maintained by Squid in shared memory]".

You may wonder how it is possible that an increase in I/O work results 
in decrease (and, hopefully, elimination) of related IPC messages. 
Roughly speaking, a worker must send an IPC "you have a new I/O 
request" message only when its worker->disker queue is empty. If the 
queue is not empty, then there is no reason to send an IPC message to 
wake up disker because disker will see the new message when dequeuing 
the previous one. Same for the opposite direction: disker->worker...

This is probably true if you have slow disks and are actually IO bound, 
but on fast disks or high page cache hit rate you essential see this ipc 
ping pong and very little actual work being done.

AFAICT, "too slow" IPC messages should result in non-empty queues and, 
hence, no IPC messages at all. For this logic to work, it does not 
matter whether the system is I/O bound or not, whether disks are "slow" 
or not.

 > Is it necessary to have these read chunks so small

It is not. Disk I/O size should be at least the system I/O page size, 
but it can be larger. The optimal I/O size is probably very dependent 
on traffic patterns. IIRC, Squid I/O size is at most one Squid page 
(SM_PAGE_SIZE or 4KB).

FWIW, I suspect there are significant inefficiencies in disk I/O 
related request alignment: The code does not attempt to read from and 
write to disk page boundaries, probably resulting in multiple 
low-level disk I/Os per one Squid 4KB I/O in some (many?) cases. With 
modern non-rotational storage these effects are probably less 
pronounced, but they probably still exist.

The kernel drivers will mostly handle this for you if multiple requests 
are available, but this is also almost irrelevant with current hardware, 
typically it will be so fast software overhead will make it hard to 
utilize modern large disk arrays properly

I doubt doing twice as many low-level disk I/Os (due to wrong alignment) 
is likely to be irrelevant, but we do not need to agree on that to make 
progress: Clearly, excessive low-level disk I/Os is not the bottleneck 
in your current environment.

you probably need to look at 
other approaches like io_ring to get rid of the classical read/write 
systemcall overhead dominating your performance.

Yes, but those things are complementary (i.e. not mutually exclusive).

Cheers,

Alex.

_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
https://lists.squid-cache.org/listinfo/squid-users