Re: How long should metrics collection on a cluster take?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <sankarshan.mukhopadhyay@xxxxxxxxx> wrote:
On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
<pkarampu@xxxxxxxxxx> wrote:
> hi,
>       Quite a few commands to monitor gluster at the moment take almost a
> second to give output.

Is this at the (most) minimum recommended cluster size?

Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
 

> Some categories of these commands:
> 1) Any command that needs to do some sort of mount/glfs_init.
>      Examples: 1) heal info family of commands 2) statfs to find
> space-availability etc (On my laptop replica 3 volume with all local bricks,
> glfs_init takes 0.3 seconds on average)
> 2) glusterd commands that need to wait for the previous command to unlock.
> If the previous command is something related to lvm snapshot which takes
> quite a few seconds, it would be even more time consuming.
>
> Nowadays container workloads have hundreds of volumes if not thousands. If
> we want to serve any monitoring solution at this scale (I have seen
> customers use upto 600 volumes at a time, it will only get bigger) and lets
> say collecting metrics per volume takes 2 seconds per volume(Let us take the
> worst example which has all major features enabled like
> snapshot/geo-rep/quota etc etc), that will mean that it will take 20 minutes
> to collect metrics of the cluster with 600 volumes. What are the ways in
> which we can make this number more manageable? I was initially thinking may
> be it is possible to get gd2 to execute commands in parallel on different
> volumes, so potentially we could get this done in ~2 seconds. But quite a
> few of the metrics need a mount or equivalent of a mount(glfs_init) to
> collect different information like statfs, number of pending heals, quota
> usage etc. This may lead to high memory usage as the size of the mounts tend
> to be high.
>

I am not sure if starting from the "worst example" (it certainly is
not) is a good place to start from.

I didn't understand your statement. Are you saying 600 volumes is a worst example?
 
That said, for any environment
with that number of disposable volumes, what kind of metrics do
actually make any sense/impact?

Same metrics you track for long running volumes. It is just that the way the metrics
are interpreted will be different. On a long running volume, you would look at the metrics
and try to find why is the volume not giving performance as expected in the last 1 hour. Where as
in this case, you would look at metrics and find the reason why volumes that were
created and deleted in the last hour didn't give performance as expected.
 

> I wanted to seek suggestions from others on how to come to a conclusion
> about which path to take and what problems to solve.
>
> I will be happy to raise github issues based on our conclusions on this mail
> thread.
>
> --
> Pranith
>





--
sankarshan mukhopadhyay
<https://about.me/sankarshan.mukhopadhyay>
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel



--
Pranith
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel

[Index of Archives]     [Gluster Users]     [Ceph Users]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Security]     [Bugtraq]     [Linux]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux