Re: How long should metrics collection on a cluster take?

Sankarshan Mukhopadhyay <sankarshan.mukhopadhyay@xxxxxxxxx> · Thu, 26 Jul 2018 10:07:58 +0530



On Wed, Jul 25, 2018 at 11:53 PM, Yaniv Kaul <ykaul@xxxxxxxxxx> wrote:
>
>
> On Tue, Jul 24, 2018, 7:20 PM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx>
> wrote:
>>
>> hi,
>>       Quite a few commands to monitor gluster at the moment take almost a
>> second to give output.
>> Some categories of these commands:
>> 1) Any command that needs to do some sort of mount/glfs_init.
>>      Examples: 1) heal info family of commands 2) statfs to find
>> space-availability etc (On my laptop replica 3 volume with all local bricks,
>> glfs_init takes 0.3 seconds on average)
>> 2) glusterd commands that need to wait for the previous command to unlock.
>> If the previous command is something related to lvm snapshot which takes
>> quite a few seconds, it would be even more time consuming.
>>
>> Nowadays container workloads have hundreds of volumes if not thousands. If
>> we want to serve any monitoring solution at this scale (I have seen
>> customers use upto 600 volumes at a time, it will only get bigger) and lets
>> say collecting metrics per volume takes 2 seconds per volume(Let us take the
>> worst example which has all major features enabled like
>> snapshot/geo-rep/quota etc etc), that will mean that it will take 20 minutes
>> to collect metrics of the cluster with 600 volumes. What are the ways in
>> which we can make this number more manageable? I was initially thinking may
>> be it is possible to get gd2 to execute commands in parallel on different
>> volumes, so potentially we could get this done in ~2 seconds. But quite a
>> few of the metrics need a mount or equivalent of a mount(glfs_init) to
>> collect different information like statfs, number of pending heals, quota
>> usage etc. This may lead to high memory usage as the size of the mounts tend
>> to be high.
>>
>> I wanted to seek suggestions from others on how to come to a conclusion
>> about which path to take and what problems to solve.
>
>
> I would imagine that in gd2 world:
> 1. All stats would be in etcd.
> 2. There will be a single API call for GetALLVolumesStats or something and
> we won't be asking the client to loop, or there will be a similar efficient
> single API to query and deliver stats for some volumes in a batch ('all
> bricks in host X' for example).
>

Single end point for metrics/monitoring was a topic that was not
agreed upon at <https://github.com/gluster/glusterd2/issues/538>

> Worth looking how it's implemented elsewhere in K8S.
>
> In any case, when asking for metrics I assume the latest already available
> would be returned and we are not going to fetch them when queried. This is
> both fragile (imagine an entity that doesn't respond well) and adds latency
> and will be inaccurate anyway a split second later.
>
> Y.


-- 
sankarshan mukhopadhyay
<https://about.me/sankarshan.mukhopadhyay>
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel