Re: Priority based ping packet for 3.10

Raghavendra G <raghavendra@xxxxxxxxxxx> · Wed, 25 Jan 2017 10:34:28 +0530

On Tue, Jan 24, 2017 at 10:39 AM, Vijay Bellur <vbellur@xxxxxxxxxx> wrote:

On Thu, Jan 19, 2017 at 8:06 AM, Jeff Darcy <jdarcy@xxxxxxxxxx> wrote:
> The more relevant question would be with TCP_KEEPALIVE and TCP_USER_TIMEOUT

> on sockets, do we really need ping-pong framework in Clients? We might need

> that in transport/rdma setups, but my question is concentrating on

> transport/rdma. In other words would like to hear why do we need heart-beat

> mechanism in the first place. One scenario might be a healthy socket level

> connection but an unhealthy brick/client (like a deadlocked one).

This is an important case to consider.  On the one hand, I think it answers

your question about TCP_KEEPALIVE.  What we really care about is whether a

brick's request queue is moving.  In other words, what's the time since the

last reply from that brick, and does that time exceed some threshold? 

I agree with this.

 On a

busy system, we don't even need ping packets to know that.  We can just use

responses on other requests to set/reset that timer.  We only need to send

ping packets when our *outbound* queue has remained empty for some fraction

of our timeout.

Do we need ping packets sent even when client is not waiting for any replies? I assume no. If there are no responses to be received and no requests being sent to a brick, why would be a client be interested in the health of server/brick?

However, it's important that our measurements be *end to end* and not just

at the transport level.  This is particularly true with multiplexing,

where multiple bricks will share and contend on various resources.  We

should ping *through* client and server, with separate translators above

and below each.  This would give us a true end-to-end ping *for that

brick*, and also keep the code nicely modular.

Agree with this. My understanding of ping framework is a tool to identify unhealthy bricks (we are interested in bricks as they are the ones going to serve fops). With that understanding ping-pong should be end to end (to whatever logical entity that constitutes brick). However, where in the brick xlator stack ping packets should be responded? Should they go all the way down to storage/posix?

+1 to this. Having ping, pong xlators immediately above and below protocol translators would also address the problem of epoll threads getting blocked in gluster's xlator stacks in busy systems.

Having said that, I do see value in Rafi's patch that prompted this thread. Would it not help to prioritize ping - pong traffic in all parts of the gluster stack including the send queue on the client?

I've two concerns here:
1. Responsiveness of brick to client invariably involves latency of network and our own transport's io-queue. Wouldn't prioritizing ping packets over normal data give us a skewed view of brick's responsiveness? For eg., On a network with heavy traffic ping-pong might be happening, but fops might be moving very slowely. What is that we achieve with a successful ping-pong in this scenario? Also, Is our response to the opposite scenario of ping-timeout happening and disconnecting the transport achieves anything substantially good? May be it helps to bring the latency of syscalls down (as experienced by application), as our HA translators like afr, EC add the latency of identifying disconnect (or  a successful fop) to latency of syscalls. As developers many of us keep wondering what is that we are trying to achieve with an heart beat mechanism.

2. Assuming that we want to prioritize ping traffic over normal traffic (which we do logically now as ping packets doesn't traverse the entire brick xlator stack all the way down to posix, instead short circuit at protocol/server), the fix in discussion here is partial (as we can't prioritize ping traffic ON the WIRE and through tcp/ip stack). While I don't have strong objections to it, I feel that its partial solution and might be inconsequential (just an hunch, no data). However, I can accept the patch, if we feel it helps.

Regards,
Vijay

_______________________________________________

Gluster-devel mailing list

Gluster-devel@xxxxxxxxxxx

http://lists.gluster.org/mailman/listinfo/gluster-devel

-- 
Raghavendra G

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://lists.gluster.org/mailman/listinfo/gluster-devel