Hi,
The total number of read requests is ~850000.
I've done some tracing of the latency that network layer introduces in gluster. I've made the analysis as part of the pgbench performance issue (in particulat the initialization and scaling phase), so I decided to look at READV for this particular workload, but I think the results can be extrapolated to other operations that also have small latency (cached data from FS for example).
Note that measuring latencies introduces some latency. It consists in a call to clock_get_time() for each probe point, so the real latency will be a bit lower, but still proportional to these numbers.
Raw results (times in microseconds):
Client:
Min Avg Max From To
1.205 3.554 50.371 client_readv -> client_submit_request
0.828 2.472 44.811 client_submit_request -> rpc_clnt_submit
0.792 5.347 267.897 rpc_clnt_submit -> socket_submit_outgoing_msg
4.337 11.229 242.354 socket_submit_outgoing_msg -> msg sent
25.447 99.988 49648.239 msg sent -> __socket_proto_state_machine **** Time taken by kernel + network + brick
9.258 21.838 544.534 __socket_proto_state_machine -> rpc_clnt_handle_reply
0.437 1.176 92.899 rpc_clnt_handle_reply -> client4_0_readv_cbk
0.359 0.750 31.189 client4_0_readv_cbk -> UNWIND
1.943 4.036 160.085 rpcsvc_handle_rpc_call -> req queued
0.361 8.989 33658.985 req queued -> rpcsvc_request_handler
0.125 0.591 47.348 rpcsvc_request_handler -> server4_0_readv
2.064 5.373 643.653 server4_0_readv -> server4_readv_resume
14.610 33.766 641.871 server4_readv_resume -> server4_readv_cbk
0.144 0.349 23.488 server4_readv_cbk -> server_submit_reply
1.018 2.302 39.741 server_submit_reply -> rpcsvc_submit_generic
0.631 1.608 42.477 rpcsvc_submit_generic -> socket_submit_outgoing_msg
6.013 22.009 48756.083 socket_submit_outgoing_msg -> msg sent
0.828 2.472 44.811 client_submit_request -> rpc_clnt_submit
0.792 5.347 267.897 rpc_clnt_submit -> socket_submit_outgoing_msg
4.337 11.229 242.354 socket_submit_outgoing_msg -> msg sent
25.447 99.988 49648.239 msg sent -> __socket_proto_state_machine **** Time taken by kernel + network + brick
9.258 21.838 544.534 __socket_proto_state_machine -> rpc_clnt_handle_reply
0.437 1.176 92.899 rpc_clnt_handle_reply -> client4_0_readv_cbk
0.359 0.750 31.189 client4_0_readv_cbk -> UNWIND
Server:
Min Avg Max From To
5.145 11.891 326.291 __socket_proto_state_machine -> rpcsvc_handle_rpc_call1.943 4.036 160.085 rpcsvc_handle_rpc_call -> req queued
0.361 8.989 33658.985 req queued -> rpcsvc_request_handler
0.125 0.591 47.348 rpcsvc_request_handler -> server4_0_readv
2.064 5.373 643.653 server4_0_readv -> server4_readv_resume
14.610 33.766 641.871 server4_readv_resume -> server4_readv_cbk
0.144 0.349 23.488 server4_readv_cbk -> server_submit_reply
1.018 2.302 39.741 server_submit_reply -> rpcsvc_submit_generic
0.631 1.608 42.477 rpcsvc_submit_generic -> socket_submit_outgoing_msg
6.013 22.009 48756.083 socket_submit_outgoing_msg -> msg sent
If we look only at the averages, we can see that the latency of the READV on posix is very low, ~33 us (the brick is a NVME, so this is expected).
The total latency on the brick side is 90.914 us (sum of all avg latencies). This seems consistent with the latency the client is seeing between the time the message is sent and the time the answer arrives. There's a 9 us gap that probably can be attributed to kernel processing (I'm using loopback device, so no real network latencies here).
This means that operations suffer from a ~70 us delay. This is not important when the operation takes in the order of milliseconds, like reads and writes on spinning disks, but for very fast operations (bricks on NVME and most of the metadata operations when the information is already cached on the brick) this means a factor of x3 times slower. So when we have huge amounts of operations with a very small latency, the overhead is huge.
If we also add the latencies on the client side, we are talking of about 146 us per read request, while the brick is able to serve them in 33 us. This is 4.4 times slower.
It's interesting to see the amount of latency that sending a message introduces: In the client side we have 11 us (it's a very small request containing only some values, like offset and size). On the brick it also takes 11 us to read it, and 22 us to send the answer. It's higher probably because it also sends the 8KB block read from disk. On the client side we can see that it also takes ~22 us to read the answer. It could be attributed to system calls needed to send the data through the socket. In this case we could try to send/recv all data with a single call. Anyway, the minimum latency is smaller which makes me think that not all latency is caused by system calls. We can also see that times on brick are very unstable: the time needed to process requests queued to another thread seems quite high sometimes. It's also high the maximum time that it could take to send the answer.
For the particular pgbench case, the number of requests is huge. With a scaling factor of just 100, it sends hundreds of thousands of requests. As an example, this run sent:
- 520154 FSTAT (24.25 us avg latency on brick)
- 847900 READV (31.30 us avg latency on brick)
- 708850 WRITEV (64.39 us avg latency on brick)
Considering a total overhead of 113 us per request, we have 235 seconds of overhead only on the network layer. Considering that the whole test took 476 seconds, this represents ~50% of the total time.
Conclusions:
- We should never have a volume with caching-related xlators disabled. The price we pay for it is too high. We need to make them work consistently and aggressively to avoid as many requests as we can.
- We need to analyze client/server xlators deeper to see if we can avoid some delays. However optimizing something that is already at the microsecond level can be very hard.
- We need to determine what causes the fluctuations in brick side and avoid them.
- This scenario is very similar to a smallfile/metadata workload, so this is probably one important cause of its bad performance.
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel