On 2023-10-14 12:04, Julian Taylor wrote:
On 14.10.23 17:40, Alex Rousskov wrote:
On 2023-10-13 16:01, Julian Taylor wrote:
When using squid for caching using the rock cache_dir setting the
performance is pretty poor with multiple workers.
The reason for this is due to the very high number of systemcalls
involved in the IPC between the disker and workers.
Please allow me to rephrase your conclusion to better match (expected)
reality and avoid misunderstanding:
By design, a mostly idle SMP Squid should use a lot more system calls
per disk cache hit than a busy SMP Squid would:
* Mostly idle Squid: Every disk I/O may require a few IPC messages.
* Busy Squid: Bugs notwithstanding, disk I/Os require no IPC messages.
In your single-request test, you are observing the expected effects
described in the first bullet. That does not imply those effects are
"good" or "desirable" in your use case, of course. It only means that
SMP Squid was no optimized for that use case; SMP rock design was
explicitly targeting the opposite use case (i.e. a busy Squid).
The reproducer uses as single request, the same very thing can be
observed on a very busy squid
If a busy Squid sends lots of IPC messages between worker and disker,
then either there is a Squid bug we do not know about OR that disker is
just not as busy as one might expect it to be.
In Squid v6+, you can observe disker queues using mgr:store_queues cache
manager report. In your environment, do those queues always have lots of
requests when Squid is busy? Feel free to share (a pointer to) a
representative sample of those reports from your busy Squid.
N.B. Besides worker-disker IPC messages, there are also worker-worker
cache synchronization IPC messages. They also have the same "do not send
IPC messages if the queue has some pending items already" optimization.
and workaround improves both the single
request case and the actual heavy loaded production squid in the same way.
FWIW, I do not think that observation contradicts anything I have said.
The hardware involved has a 10G card, not ssds but lots of ram so it has
a very high page cache hit rate and the squid is very busy, so much it
is overloaded by system cpu usage in default configuration with the rock
cache. The network or disk bandwidth is barely ever utilized more than
10% with all 8 cpus busy on system load.
The above facts suggest that the disk is just not used much OR there is
a bug somewhere. Slower (for any reason, including CPU overload) IPC
messages should lead to longer queues and the disappearance of "your
queue is no longer empty!" IPC messages.
The only way to get the squid to utilize the machine is to increase the
IO size via the request buffer change or not use the rock cache. UFS
cache works ok in comparison, but requires multiple independent squid
instances as it does not support SMP.
Increasing the IO size to 32KiB as I mentioned does allow the squid
workers to utilize a good 60% of the hardware network and disk
capabilities.
Please note that I am not disputing this observation. Unfortunately, it
does not help me guess where the actual/core problem or bottleneck is.
Hopefully, cache manager mgr:store_queues report will shed some light.
Roughly speaking, here, "busy" means "there are always some messages
in the disk I/O queue [maintained by Squid in shared memory]".
You may wonder how it is possible that an increase in I/O work results
in decrease (and, hopefully, elimination) of related IPC messages.
Roughly speaking, a worker must send an IPC "you have a new I/O
request" message only when its worker->disker queue is empty. If the
queue is not empty, then there is no reason to send an IPC message to
wake up disker because disker will see the new message when dequeuing
the previous one. Same for the opposite direction: disker->worker...
This is probably true if you have slow disks and are actually IO bound,
but on fast disks or high page cache hit rate you essential see this ipc
ping pong and very little actual work being done.
AFAICT, "too slow" IPC messages should result in non-empty queues and,
hence, no IPC messages at all. For this logic to work, it does not
matter whether the system is I/O bound or not, whether disks are "slow"
or not.
> Is it necessary to have these read chunks so small
It is not. Disk I/O size should be at least the system I/O page size,
but it can be larger. The optimal I/O size is probably very dependent
on traffic patterns. IIRC, Squid I/O size is at most one Squid page
(SM_PAGE_SIZE or 4KB).
FWIW, I suspect there are significant inefficiencies in disk I/O
related request alignment: The code does not attempt to read from and
write to disk page boundaries, probably resulting in multiple
low-level disk I/Os per one Squid 4KB I/O in some (many?) cases. With
modern non-rotational storage these effects are probably less
pronounced, but they probably still exist.
The kernel drivers will mostly handle this for you if multiple requests
are available, but this is also almost irrelevant with current hardware,
typically it will be so fast software overhead will make it hard to
utilize modern large disk arrays properly
I doubt doing twice as many low-level disk I/Os (due to wrong alignment)
is likely to be irrelevant, but we do not need to agree on that to make
progress: Clearly, excessive low-level disk I/Os is not the bottleneck
in your current environment.
you probably need to look at
other approaches like io_ring to get rid of the classical read/write
systemcall overhead dominating your performance.
Yes, but those things are complementary (i.e. not mutually exclusive).
Cheers,
Alex.
_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
https://lists.squid-cache.org/listinfo/squid-users