Re: very poor performance of rock cache ipc

Alex Rousskov <rousskov@xxxxxxxxxxxxxxxxxxxxxxx> · Mon, 16 Oct 2023 23:26:50 -0400

On 2023-10-16 16:24, Julian Taylor wrote:
On 15.10.23 05:42, Alex Rousskov wrote:
On 2023-10-14 12:04, Julian Taylor wrote:
On 14.10.23 17:40, Alex Rousskov wrote:
On 2023-10-13 16:01, Julian Taylor wrote:

The reproducer uses as single request, the same very thing can be 
observed on a very busy squid

If a busy Squid sends lots of IPC messages between worker and disker, 
then either there is a Squid bug we do not know about OR that disker 
is just not as busy as one might expect it to be.

In Squid v6+, you can observe disker queues using mgr:store_queues 
cache manager report. In your environment, do those queues always have 
lots of requests when Squid is busy? Feel free to share (a pointer to) 
a representative sample of those reports from your busy Squid.

N.B. Besides worker-disker IPC messages, there are also worker-worker 
cache synchronization IPC messages. They also have the same "do not 
send IPC messages if the queue has some pending items already" 
optimization.

I checked the queues running with the configuration from my initial mail 
with workers increase and the queues are generally low, around 1-10 
items in the queue when sending around 100 parallel requests reading 
about 100mb data files. Here is a sample: https://dpaste.com/8SLNRW5F8
Also with the higher request rate than the single curl the majority of 
work throughput was more than doubled by increasing the blocksize.

How are the queues supposed to look like on a busy squid that is not 
spending a large portion of its time doing notify IPC?

The queues are supposed to look "not empty" -- a non-empty queue does 
not result in IPC notifications. Needless to say, the further away from 
"empty" the queues are, the lesser the chance they will become empty 
when cache manager report is _not_ "looking" at them.

Increasing the parallel requests does decrease the amount of overhead 
but its still pretty large, I measured about 10%-30% cpu overhead with 
100 parallel requests served from cache in the worker and disker
Here a snipped of a profile:
--22.34%--JobDialer<AsyncJob>::dial(AsyncCall&)
    |
    |--21.19%--Ipc::UdsSender::start()
    |       |
    |        --21.13%--Ipc::UdsSender::write()
    |           |
    |           |--16.12%--Ipc::UdsOp::conn()
    |           |          |
    |           |           --15.84%--comm_open_uds(int, int, 
sockaddr_un*, int)
    |           |                |--1.70%--commSetCloseOnExec(int)
    |           |                 --1.56%--commSetNonBlocking(int)
   ...
--12.98%--comm_close_complete(int)

Clearing and constructing the large Ipc::TypedMsgHdr is also very 
noticeable.

That the overhead and maximum throughput is so low for not so busy 
squids (say 1-10 requests per second but requests on average > 1MiB) is 
imo also a reason for concern and could be improved.

I agree.

If I understand the way it works correctly e.g. the worker when it gets 
a request splits it into 4k blocks and enqueues read requests into the 
ipc queue and if the queue is empty it emits a notify ipc so the disker 
starts popping from the queue.

Yes, at some level of abstraction, the above summary is not wrong. 
However, please keep in mind that, for a single HTTP transaction, most 
of the disk read requests are queued by worker, read by disk, and 
received by worker one read request at a time. There is no disk read 
"prefetching" (yet?).

On large requests that are answered immediately from the disker the 
problem seems to be that the queue is mostly empty and it sends an ipc 
ping pong for each 4k block.

Due to lack of prefetching, the total size of the HTTP response does not 
really affect the queue length. Only the transaction concurrency level 
does; on average, that is determined by mean response time multiplied by 
the I/O request rate from a particular worker to a particular disker.

So my though was when the request is larger than 4k enqueue multiple 
pending reads in the worker and only notify after a certain amount has 
been added to the queue, vice versa in the disker.

So I messed around a bit trying to reduce the notifications by delaying 
the Notify call in src/DiskIO/IpcIo/IpcIoFile.cc for larger requests but 
it ended up blocking after the first queue push with no notify. If I 
understand the queue correctly this is due to the reader requires a 
notify to initially start and and simply pushing multiple read requests 
onto the queue without notifying will not work

You are correct.

Is this approach feasible or am I misunderstanding how it works?

Prefetching is feasible in principle, but is not easy to implement well 
and will probably require configuration options (because it will slow 
down busy Squids that do not have the time to prefetch but may not know 
that).

I would consider increasing I/O size (and shared memory page size) 
instead, at least as the first step. Doing so well is not trivial 
either, but may be easier and beneficial to more use cases.

I also tried to add reusing of the IPC connection between calls so the 
major source of overhead,tearing down and reestablishing the connection, 
is removed but that also turned out difficult as the connections are 
closed in various places and the general complexity of the code.

Yes, that would be nice. Reusing sockets is especially difficult to get 
right with startup/bootstrapping, reconfigurations, and kid 
death/restarts problems in mind. On the other hand, it is probably much 
easier to optimize this than to implement disk hit "prefetching".

There may be some other, more effcient IPC notification mechanisms 
available on your OS that Squid can be enhanced to support. I have not 
surveyed what is available these days.

HTH,

Alex.

_______________________________________________
squid-users mailing list
squid-users@xxxxxxxxxxxxxxxxxxxxx
https://lists.squid-cache.org/listinfo/squid-users