Latency analysis of GlusterFS' network layer for pgbench

Xavi Hernandez <xhernandez@xxxxxxxxxx> · Fri, 21 Dec 2018 09:18:24 +0100

Hi,
I've done some tracing of the latency that network layer introduces in gluster. I've made the analysis as part of the pgbench performance issue (in particulat the initialization and scaling phase), so I decided to look at READV for this particular workload, but I think the results can be extrapolated to other operations that also have small latency (cached data from FS for example).

Note that measuring latencies introduces some latency. It consists in a call to clock_get_time() for each probe point, so the real latency will be a bit lower, but still proportional to these numbers.

Raw results (times in microseconds):

Client:
     Min      Avg       Max   From                            To
    1.205    3.554    50.371  client_readv                 -> client_submit_request
    0.828    2.472    44.811  client_submit_request        -> rpc_clnt_submit
    0.792    5.347   267.897  rpc_clnt_submit              -> socket_submit_outgoing_msg
    4.337   11.229   242.354  socket_submit_outgoing_msg   -> msg sent
   25.447   99.988 49648.239  msg sent                     -> __socket_proto_state_machine  **** Time taken by kernel + network + brick
    9.258   21.838   544.534  __socket_proto_state_machine -> rpc_clnt_handle_reply
    0.437    1.176    92.899  rpc_clnt_handle_reply        -> client4_0_readv_cbk
    0.359    0.750    31.189  client4_0_readv_cbk          -> UNWIND 
Server:
     Min      Avg       Max   From                            To
    5.145   11.891   326.291  __socket_proto_state_machine -> rpcsvc_handle_rpc_call
    1.943    4.036   160.085  rpcsvc_handle_rpc_call       -> req queued
    0.361    8.989 33658.985  req queued                   -> rpcsvc_request_handler
    0.125    0.591    47.348  rpcsvc_request_handler       -> server4_0_readv
    2.064    5.373   643.653  server4_0_readv              -> server4_readv_resume
   14.610   33.766   641.871  server4_readv_resume         -> server4_readv_cbk
    0.144    0.349    23.488  server4_readv_cbk            -> server_submit_reply
    1.018    2.302    39.741  server_submit_reply          -> rpcsvc_submit_generic
    0.631    1.608    42.477  rpcsvc_submit_generic        -> socket_submit_outgoing_msg
    6.013   22.009 48756.083  socket_submit_outgoing_msg   -> msg sent

The total number of read requests is ~850000.

If we look only at the averages, we can see that the latency of the READV on posix is very low, ~33 us (the brick is a NVME, so this is expected).

The total latency on the brick side is 90.914 us (sum of all avg latencies). This seems consistent with the latency the client is seeing between the time the message is sent and the time the answer arrives. There's a 9 us gap that probably can be attributed to kernel processing (I'm using loopback device, so no real network latencies here).

This means that operations suffer from a ~70 us delay. This is not important when the operation takes in the order of milliseconds, like reads and writes on spinning disks, but for very fast operations (bricks on NVME and most of the metadata operations when the information is already cached on the brick) this means a factor of x3 times slower. So when we have huge amounts of operations with a very small latency, the overhead is huge.

If we also add the latencies on the client side, we are talking of about 146 us per read request, while the brick is able to serve them in 33 us. This is 4.4 times slower.

It's interesting to see the amount of latency that sending a message introduces: In the client side we have 11 us (it's a very small request containing only some values, like offset and size). On the brick it also takes 11 us to read it, and 22 us to send the answer. It's higher probably because it also sends the 8KB block read from disk. On the client side we can see that it also takes ~22 us to read the answer. It could be attributed to system calls needed to send the data through the socket. In this case we could try to send/recv all data with a single call. Anyway, the minimum latency is smaller which makes me think that not all latency is caused by system calls. We can also see that times on brick are very unstable: the time needed to process requests queued to another thread seems quite high sometimes. It's also high the maximum time that it could take to send the answer.

For the particular pgbench case, the number of requests is huge. With a scaling factor of just 100, it sends hundreds of thousands of requests. As an example, this run sent:
520154 FSTAT (24.25 us avg latency on brick)
847900 READV (31.30 us avg latency on brick)
708850 WRITEV (64.39 us avg latency on brick)
Considering a total overhead of 113 us per request, we have 235 seconds of overhead only on the network layer. Considering that the whole test took 476 seconds, this represents ~50% of the total time.

Conclusions:
We should never have a volume with caching-related xlators disabled. The price we pay for it is too high. We need to make them work consistently and aggressively to avoid as many requests as we can.
We need to analyze client/server xlators deeper to see if we can avoid some delays. However optimizing something that is already at the microsecond level can be very hard.
We need to determine what causes the fluctuations in brick side and avoid them.
This scenario is very similar to a smallfile/metadata workload, so this is probably one important cause of its bad performance.
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel