Re: Fwd: Re: cpg latency

Angus Salkeld <asalkeld@xxxxxxxxxx> · Mon, 21 May 2012 21:41:23 +1000

On 18/05/12 21:15 -0700, Steven Dake wrote:
On 05/18/2012 01:42 PM, Steven Dake wrote:

Taking back to list...

odd that more cpg clients reduce latency..

Assuming the instrumentation is correct (everything I read says don't
trust the TSC), the results are:

rbreader->rbwriter = 10-20 usec
corosync 1 client without totem = 70-120 usec
corosync 1 client with totem = 150-160 usec
corosync 4 client without totem = 40-50 usec
corosync 4 client with totem = 50-60 usec

This implies corosync adds 40usec of latency on top of the ringbuffer
implementation.

Running rbwriter 1000 times in a loop with a rbreader running shows the
following oprofile results:

rb-reader:

2282834  62.0242  libc-2.14.90.so          __memcpy_ssse3_back
495055   13.4506  libqb.so.0.13.0          qb_rb_space_used
378187   10.2753  libpthread-2.14.90.so    sem_timedwait
207293    5.6321  libqb.so.0.13.0          qb_rb_chunk_check
82163     2.2324  libqb.so.0.13.0          qb_rb_chunk_read
78788     2.1407  libqb.so.0.13.0          qb_rb_chunk_reclaim
42200     1.1466  libqb.so.0.13.0          my_posix_sem_timedwait
35622     0.9678  [vdso] (tgid:20388
range:0x7fff2e5ff000-0x7fff2e600000) [vdso]
 (tgid:20388 range:0x7fff2e5ff000-0x7fff2e600000)
32311     0.8779  librt-2.14.90.so         clock_gettime

rb-writer:
852329   45.5852  libc-2.14.90.so          __memcpy_ssse3_back
413116   22.0947  libpthread-2.14.90.so    sem_post
219898   11.7608  libqb.so.0.13.0          qb_rb_space_free
161334    8.6286  libqb.so.0.13.0          qb_rb_chunk_commit
113573    6.0742  libqb.so.0.13.0          qb_rb_chunk_alloc
69525     3.7184  libqb.so.0.13.0          qb_rb_chunk_write
13529     0.7236  lt-rbwriter              main

rb-reader spending 4x as much time in memcpy - possible cache alignment
issue.


Angus,

It might make more sense to store the used and free space in the header
to avoid their 11% cpu utilization.  The routines are a bit complex and
take alot of clock cycles with the math + comparisons + jumping.  If
stored in the header, each process is responsible for maintaining its
values (write responsible for setting used, read responsible for setting
free) which would convert those to a simple addition and subtraction for
each message.

I prototyped this change by changing rbreader not to call
qb_rbspace_used (they are only error checks atm), and throughput dropped...

I tried a different change of adding a printf for each message received
in rbreader, and throughput increased 4x.

The conclusion about these points is sem_wait enters the kernel via
syscall (futex) more often because it is often waiting on new data from
the writer.

In the writer flooding the ring buffer faster then the reader can
process them, the user space semaphore doesn't need to context switch
(sem_wait returns immediately).  This results in 550mb/sec for the small
message sizes (vs 100mb/sec when always hitting the context switch).

I have tried a number of alternatives but so far the fastest is without
any semaphores or atomics. So this means no count and ignoring the 
ms_timeout passed into the rb. That should be ok as I have implemented
it as an option you can pass into qb_rb_open().

In the IPC I have tried eventfd (as Vladimir suggested) and that really
helps.

Vladimir, if you want to give this a go I have a branch on github:
https://github.com/asalkeld/libqb/commits/speed

That includes these changes. Note is not complete (a bit hacky:)
I still need to make all that code portable and test. But it improves
oneway comms by nearly 2x.

I also still need to hook up the event channel to use eventfd.

-Angus


This would also explain the more clients = less latency pattern seen,
since more clients will cause the server's ring buffers to back up

Hope this helps.

Vladimir,

I am pretty sure there is no way to implement a process shared ring
buffer without a semaphore (or some other dec&test mechanism) that
doesn't result in excessive cpu utilization (a nieve approach would be
for reader to spin on reading).  In the case of the other program you
spoke about, they batch up 1000 messages.  This allows the reader to
read all 1000 messages without having to context switch into the kernel
at inopportune times.  Essentially their ring buffer backs up a slight bit.

As an example, consider the following (in the libqb context):
1) writer writes a message into the ring buffer
2) writer sem_posts
3) writer writes a message into the ring buffer
4) reader sitting in futex (sem_wait) awakens
5) reader reads message
6) reader sem_waits, but because step 3 didn't finish its sem_post, it
enters a kernel wait queue
7) writer sem_posts
8) reader awakens and processes message

In a scenario where there are batched messages, step 6 doesn't result in
a context switch, but instead an immediate processing of the message on
multiprocessors.

It would be interesting to see what the latency is like without the
above example scenario.  I think I got this mostly to happen by simply
adding a printf (to slightly delay rbreader) with the attached patch.

Regards
-steve


Regards
-steve

-------- Original Message --------
Subject: Re:  cpg latency
Date: Fri, 18 May 2012 12:19:25 +0400
From: Voznesensky Vladimir <voznesensky@xxxxxx>
To: Steven Dake <sdake@xxxxxxxxxx>

We had tested it.

1 process sending 1 message per second gave:
70-120 microseconds latency without totem;
150-160 microseconds latency with totem.

4 processes each sending 1 message per second gave:
40-50 microseconds latency without totem;
50-60 microseconds latency with totem.

Thank you.
- VV



On 18.05.2012 11:03, Steven Dake wrote:
I have sOn 05/17/2012 08:19 AM, Voznesensky Vladimir wrote:
I've just tested libqb. For every message I've:
- Instrumented a client with tcs reading before qb_ipcc_send and after
qb_ipcc_recv;
- Instrumented a client with CPU frequency testing.
- Commented out qb_log in s1_msg_process_fn of a server.

So, it took 0.000140 sec - 0.000156 sec for every message to pass and
return.
As I understood, I'ts very large number.
Compare, for instance, with
http://code.google.com/p/disruptor/

Thanks.

I have sent a test patch which short-circuits totem for cpg messaging
single node.

Could you see how the latency looks (or post a sample latency tester).

Regards
-steve


_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

diff --git a/tests/rbreader.c b/tests/rbreader.c
index 6b5356b..49aba12 100644
--- a/tests/rbreader.c
+++ b/tests/rbreader.c
@@ -62,6 +62,7 @@ int32_t main(int32_t argc, char *argv[])
	while (keep_reading) {
		num_read = qb_rb_chunk_read(rb, buffer,
					    BUFFER_CHUNK_SIZE, 5500);
+		printf ("%d\n", num_read);
		if (num_read < 0) {
			qb_perror(LOG_ERR, "nothing to read");
		}

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss