On Wed, Jul 25, 2018 at 11:53 PM, Yaniv Kaul <ykaul@xxxxxxxxxx> wrote: > > > On Tue, Jul 24, 2018, 7:20 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> > wrote: >> >> hi, >> Quite a few commands to monitor gluster at the moment take almost a >> second to give output. >> Some categories of these commands: >> 1) Any command that needs to do some sort of mount/glfs_init. >> Examples: 1) heal info family of commands 2) statfs to find >> space-availability etc (On my laptop replica 3 volume with all local bricks, >> glfs_init takes 0.3 seconds on average) >> 2) glusterd commands that need to wait for the previous command to unlock. >> If the previous command is something related to lvm snapshot which takes >> quite a few seconds, it would be even more time consuming. >> >> Nowadays container workloads have hundreds of volumes if not thousands. If >> we want to serve any monitoring solution at this scale (I have seen >> customers use upto 600 volumes at a time, it will only get bigger) and lets >> say collecting metrics per volume takes 2 seconds per volume(Let us take the >> worst example which has all major features enabled like >> snapshot/geo-rep/quota etc etc), that will mean that it will take 20 minutes >> to collect metrics of the cluster with 600 volumes. What are the ways in >> which we can make this number more manageable? I was initially thinking may >> be it is possible to get gd2 to execute commands in parallel on different >> volumes, so potentially we could get this done in ~2 seconds. But quite a >> few of the metrics need a mount or equivalent of a mount(glfs_init) to >> collect different information like statfs, number of pending heals, quota >> usage etc. This may lead to high memory usage as the size of the mounts tend >> to be high. >> >> I wanted to seek suggestions from others on how to come to a conclusion >> about which path to take and what problems to solve. > > > I would imagine that in gd2 world: > 1. All stats would be in etcd. > 2. There will be a single API call for GetALLVolumesStats or something and > we won't be asking the client to loop, or there will be a similar efficient > single API to query and deliver stats for some volumes in a batch ('all > bricks in host X' for example). > Single end point for metrics/monitoring was a topic that was not agreed upon at <https://github.com/gluster/glusterd2/issues/538> > Worth looking how it's implemented elsewhere in K8S. > > In any case, when asking for metrics I assume the latest already available > would be returned and we are not going to fetch them when queried. This is > both fragile (imagine an entity that doesn't respond well) and adds latency > and will be inaccurate anyway a split second later. > > Y. -- sankarshan mukhopadhyay <https://about.me/sankarshan.mukhopadhyay> _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel