Hello. Please, find the comments below... On 19.05.2012 08:15, Steven Dake wrote: It seems, that we do not need separate code and variablesAngus, It might make more sense to store the used and free space in the header to avoid their 11% cpu utilization. The routines are a bit complex and take alot of clock cycles with the math + comparisons + jumping. If stored in the header, each process is responsible for maintaining its values (write responsible for setting used, read responsible for setting free) which would convert those to a simple addition and subtraction for each message. for free space and used space calculation. These are the places for CPU cache collisions. Let's our eventfd was created without EFD_SEMAPHORE flag, so every read flushes the counter to zero. Let's the size of a message do not exceed the buffer size - 1 to prevent unnecessary memory copying. An eventfd is created on the writing process and sent to the reading process thru the unix socket. Every message is written with it's size in the starting 4 bytes, so every normal message is, at least, 4 bytes in size. So, every message write to the ring buffer consist of the steps:
The message flow is not so high in this case...The conclusion about these points is sem_wait enters the kernel via syscall (futex) more often because it is often waiting on new data from the writer. In the writer flooding the ring buffer faster then the reader can process them, the user space semaphore doesn't need to context switch (sem_wait returns immediately). This results in 550mb/sec for the small message sizes (vs 100mb/sec when always hitting the context switch). This would also explain the more clients = less latency pattern seen, since more clients will cause the server's ring buffers to back up Sure, but we could use eventfd which seems to be the fastestVladimir, I am pretty sure there is no way to implement a process shared ring buffer without a semaphore (or some other dec&test mechanism) that doesn't result in excessive cpu utilization (a nieve approach would be for reader to spin on reading). file type in the Linux kernel. Inter-process magick...In the case of the other program you spoke about, they batch up 1000 messages. This allows the reader to read all 1000 messages without having to context switch into the kernel at inopportune times. Essentially their ring buffer backs up a slight bit. As an example, consider the following (in the libqb context): 1) writer writes a message into the ring buffer 2) writer sem_posts 3) writer writes a message into the ring buffer 4) reader sitting in futex (sem_wait) awakens 5) reader reads message 6) reader sem_waits, but because step 3 didn't finish its sem_post, it enters a kernel wait queue 7) writer sem_posts 8) reader awakens and processes message In a scenario where there are batched messages, step 6 doesn't result in a context switch, but instead an immediate processing of the message on multiprocessors. It would be interesting to see what the latency is like without the above example scenario. I think I got this mostly to happen by simply adding a printf (to slightly delay rbreader) with the attached patch. |
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss