Re: Fwd: Re: cpg latency

Angus Salkeld <asalkeld@xxxxxxxxxx> · Mon, 21 May 2012 21:46:32 +1000

On 21/05/12 15:34 +0400, Voznesensky Vladimir wrote:
Hello.

Please, find the comments below...

On 19.05.2012 08:15, Steven Dake wrote:
Angus,

It might make more sense to store the used and free space in the header
to avoid their 11% cpu utilization.  The routines are a bit complex and
take alot of clock cycles with the math + comparisons + jumping.  If
stored in the header, each process is responsible for maintaining its
values (write responsible for setting used, read responsible for setting
free) which would convert those to a simple addition and subtraction for
each message.
It seems, that we do not need separate code and variables
for free space and used space calculation. These are the places
for CPU cache collisions.

Let's our eventfd was created without EFD_SEMAPHORE flag,
so every read flushes the counter to zero.

Let's the size of a message do not exceed the buffer
size - 1 to prevent unnecessary memory copying.

An eventfd is created on the writing process and sent to the
reading process thru the unix socket.

Every message is written with it's size in the starting 4 bytes,
so every normal message is, at least, 4 bytes in size.

So, every message write to the ring buffer consist of the steps:

1. If buffer_size - (write_pointer - read_pointer) % buffer_size <
  message_size + 1,
    * set errno=EAGAIN and return;
2. If write_pointer + message_size > buf_pointer + buf_size
    * write a special message to mark buffer overlimit,
    * set write_pointer = buf_pointer,
    * do the 1st step again and goto 3.
3. Write the message starting from the write pointer;
4. Write 1 to the eventfd;
5. <write pointer> += <message size>.

Every read consists of the steps:

1. Poll the file descriptors including the eventfd in the main loop, so
  the eventfd becomes readable;
2. Read the value from the eventfd.
3. Read the message from the buffer;
4. If the message is special,
    * set read_pointer= buffer_pointer,
    * do the 3d step again;
5. Process the message;
6. read_pointer += message_size;
7. If the read pointer is not equal to the write pointer,
    * goto 3.

Am I correct here? Is my pseudocode optimal?

Pretty much spot on:)

check out the code on github.

-Angus

The conclusion about these points is sem_wait enters the kernel via
syscall (futex) more often because it is often waiting on new data from
the writer.

In the writer flooding the ring buffer faster then the reader can
process them, the user space semaphore doesn't need to context switch
(sem_wait returns immediately).  This results in 550mb/sec for the small
message sizes (vs 100mb/sec when always hitting the context switch).

This would also explain the more clients = less latency pattern seen,
since more clients will cause the server's ring buffers to back up
The message flow is not so high in this case...
Vladimir,

I am pretty sure there is no way to implement a process shared ring
buffer without a semaphore (or some other dec&test mechanism) that
doesn't result in excessive cpu utilization (a nieve approach would be
for reader to spin on reading).
Sure, but we could use eventfd which seems to be the fastest
file type in the Linux kernel.
In the case of the other program you
spoke about, they batch up 1000 messages.  This allows the reader to
read all 1000 messages without having to context switch into the kernel
at inopportune times.  Essentially their ring buffer backs up a slight bit.

As an example, consider the following (in the libqb context):
1) writer writes a message into the ring buffer
2) writer sem_posts
3) writer writes a message into the ring buffer
4) reader sitting in futex (sem_wait) awakens
5) reader reads message
6) reader sem_waits, but because step 3 didn't finish its sem_post, it
enters a kernel wait queue
7) writer sem_posts
8) reader awakens and processes message

In a scenario where there are batched messages, step 6 doesn't result in
a context switch, but instead an immediate processing of the message on
multiprocessors.

It would be interesting to see what the latency is like without the
above example scenario.  I think I got this mostly to happen by simply
adding a printf (to slightly delay rbreader) with the attached patch.
Inter-process magick...

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss