Re: Fwd: Re: cpg latency

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/18/2012 01:42 PM, Steven Dake wrote:
> 
> Taking back to list...
> 
> odd that more cpg clients reduce latency..
> 
> Assuming the instrumentation is correct (everything I read says don't
> trust the TSC), the results are:
> 
> rbreader->rbwriter = 10-20 usec
> corosync 1 client without totem = 70-120 usec
> corosync 1 client with totem = 150-160 usec
> corosync 4 client without totem = 40-50 usec
> corosync 4 client with totem = 50-60 usec
> 
> This implies corosync adds 40usec of latency on top of the ringbuffer
> implementation.
> 
> Running rbwriter 1000 times in a loop with a rbreader running shows the
> following oprofile results:
> 
> rb-reader:
> 
> 2282834  62.0242  libc-2.14.90.so          __memcpy_ssse3_back
> 495055   13.4506  libqb.so.0.13.0          qb_rb_space_used
> 378187   10.2753  libpthread-2.14.90.so    sem_timedwait
> 207293    5.6321  libqb.so.0.13.0          qb_rb_chunk_check
> 82163     2.2324  libqb.so.0.13.0          qb_rb_chunk_read
> 78788     2.1407  libqb.so.0.13.0          qb_rb_chunk_reclaim
> 42200     1.1466  libqb.so.0.13.0          my_posix_sem_timedwait
> 35622     0.9678  [vdso] (tgid:20388
> range:0x7fff2e5ff000-0x7fff2e600000) [vdso]
>  (tgid:20388 range:0x7fff2e5ff000-0x7fff2e600000)
> 32311     0.8779  librt-2.14.90.so         clock_gettime
> 
> rb-writer:
> 852329   45.5852  libc-2.14.90.so          __memcpy_ssse3_back
> 413116   22.0947  libpthread-2.14.90.so    sem_post
> 219898   11.7608  libqb.so.0.13.0          qb_rb_space_free
> 161334    8.6286  libqb.so.0.13.0          qb_rb_chunk_commit
> 113573    6.0742  libqb.so.0.13.0          qb_rb_chunk_alloc
> 69525     3.7184  libqb.so.0.13.0          qb_rb_chunk_write
> 13529     0.7236  lt-rbwriter              main
> 
> rb-reader spending 4x as much time in memcpy - possible cache alignment
> issue.
> 

Angus,

It might make more sense to store the used and free space in the header
to avoid their 11% cpu utilization.  The routines are a bit complex and
take alot of clock cycles with the math + comparisons + jumping.  If
stored in the header, each process is responsible for maintaining its
values (write responsible for setting used, read responsible for setting
free) which would convert those to a simple addition and subtraction for
each message.

I prototyped this change by changing rbreader not to call
qb_rbspace_used (they are only error checks atm), and throughput dropped...

I tried a different change of adding a printf for each message received
in rbreader, and throughput increased 4x.

The conclusion about these points is sem_wait enters the kernel via
syscall (futex) more often because it is often waiting on new data from
the writer.

In the writer flooding the ring buffer faster then the reader can
process them, the user space semaphore doesn't need to context switch
(sem_wait returns immediately).  This results in 550mb/sec for the small
message sizes (vs 100mb/sec when always hitting the context switch).

This would also explain the more clients = less latency pattern seen,
since more clients will cause the server's ring buffers to back up

Hope this helps.

Vladimir,

I am pretty sure there is no way to implement a process shared ring
buffer without a semaphore (or some other dec&test mechanism) that
doesn't result in excessive cpu utilization (a nieve approach would be
for reader to spin on reading).  In the case of the other program you
spoke about, they batch up 1000 messages.  This allows the reader to
read all 1000 messages without having to context switch into the kernel
at inopportune times.  Essentially their ring buffer backs up a slight bit.

As an example, consider the following (in the libqb context):
1) writer writes a message into the ring buffer
2) writer sem_posts
3) writer writes a message into the ring buffer
4) reader sitting in futex (sem_wait) awakens
5) reader reads message
6) reader sem_waits, but because step 3 didn't finish its sem_post, it
enters a kernel wait queue
7) writer sem_posts
8) reader awakens and processes message

In a scenario where there are batched messages, step 6 doesn't result in
a context switch, but instead an immediate processing of the message on
multiprocessors.

It would be interesting to see what the latency is like without the
above example scenario.  I think I got this mostly to happen by simply
adding a printf (to slightly delay rbreader) with the attached patch.

Regards
-steve


> Regards
> -steve
> 
> -------- Original Message --------
> Subject: Re:  cpg latency
> Date: Fri, 18 May 2012 12:19:25 +0400
> From: Voznesensky Vladimir <voznesensky@xxxxxx>
> To: Steven Dake <sdake@xxxxxxxxxx>
> 
> We had tested it.
> 
> 1 process sending 1 message per second gave:
> 70-120 microseconds latency without totem;
> 150-160 microseconds latency with totem.
> 
> 4 processes each sending 1 message per second gave:
> 40-50 microseconds latency without totem;
> 50-60 microseconds latency with totem.
> 
> Thank you.
> - VV
> 
> 
> 
> On 18.05.2012 11:03, Steven Dake wrote:
>> I have sOn 05/17/2012 08:19 AM, Voznesensky Vladimir wrote:
>>> I've just tested libqb. For every message I've:
>>> - Instrumented a client with tcs reading before qb_ipcc_send and after
>>> qb_ipcc_recv;
>>> - Instrumented a client with CPU frequency testing.
>>> - Commented out qb_log in s1_msg_process_fn of a server.
>>>
>>> So, it took 0.000140 sec - 0.000156 sec for every message to pass and
>>> return.
>>> As I understood, I'ts very large number.
>>> Compare, for instance, with
>>> http://code.google.com/p/disruptor/
>>>
>>> Thanks.
>>>
>> I have sent a test patch which short-circuits totem for cpg messaging
>> single node.
>>
>> Could you see how the latency looks (or post a sample latency tester).
>>
>> Regards
>> -steve
>>
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

diff --git a/tests/rbreader.c b/tests/rbreader.c
index 6b5356b..49aba12 100644
--- a/tests/rbreader.c
+++ b/tests/rbreader.c
@@ -62,6 +62,7 @@ int32_t main(int32_t argc, char *argv[])
 	while (keep_reading) {
 		num_read = qb_rb_chunk_read(rb, buffer,
 					    BUFFER_CHUNK_SIZE, 5500);
+		printf ("%d\n", num_read);
 		if (num_read < 0) {
 			qb_perror(LOG_ERR, "nothing to read");
 		}
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux