Hello all,
I am trying to make a note of all gluster operations that happen over network inorder to be able to tune gluster nodes as required.
First of all our setup is far from ideal, but we want to tune it to the best as possible.
-> Our nodes are also LSF execution nodes and due to this, we have a shared load for cpu, memory and network. (For cpu, memory and network, we are planning to use cgroups to make enough resources available for gluster).
-> However, in our LSF setup, we allow jobs to use more than the requested memory and hence, we can expect aggressive swapping when there's too much requirement.
-> on top of that, our swap disk, gluster bricks and entire os filesystem comes from same raid disk. So, whenever there's swapping, our only disk's utilization goes over the top and in turn affects gluster IO.
-> slower performance is okay, as we will take necessary steps in time i.e kill jobs that are using memory more than the requested.
-> but we don't want network timeout or connection reset errors which could mess the entire cluster operations and would need a bit of heavy work to resolve them.
-> I'm not sure if the above scenario can cause these timeout errors. However, there are other cases which can cause these and are also observed.
-> we increased transport.listen-backlog in gluster to a higher value: 200 and tuned kenel somaxcon=1024, syn_backlog=20480
-> these are just random high values, but not sure if these are enough.
-> so, we can fairly expect timeout errors as our tuning is not perfect. Hence, to be able to analyze these issues, I want to find out possible number of pending connections, network communications and for that, I need to know all the gluster operations and their frequency.
Example:
Self Heal daemon operations
Peer communications for gluster peer status..does this happen?
And etc.
Thanks in advance.
Regards,
Jeevan.
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users