To add an additional data point... The operator will need to regularly reconcile the true state of the gluster cluster with the desired state stored in kubernetes. This task will be required frequently (i.e., operator-framework defaults to every 5s even if there are no config changes).
The actual amount of data we will need to query from the cluster is currently TBD and likely significantly affected by Heketi/GD1 vs. GD2 choice.
-John
On Wed, Jul 25, 2018 at 5:45 AM Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> wrote:
_______________________________________________On Tue, Jul 24, 2018 at 10:10 PM, Sankarshan Mukhopadhyay <sankarshan.mukhopadhyay@xxxxxxxxx> wrote:On Tue, Jul 24, 2018 at 9:48 PM, Pranith Kumar Karampuri
<pkarampu@xxxxxxxxxx> wrote:
> hi,
> Quite a few commands to monitor gluster at the moment take almost a
> second to give output.
Is this at the (most) minimum recommended cluster size?Yes, with a single volume with 3 bricks i.e. 3 nodes in cluster.
> Some categories of these commands:
> 1) Any command that needs to do some sort of mount/glfs_init.
> Examples: 1) heal info family of commands 2) statfs to find
> space-availability etc (On my laptop replica 3 volume with all local bricks,
> glfs_init takes 0.3 seconds on average)
> 2) glusterd commands that need to wait for the previous command to unlock.
> If the previous command is something related to lvm snapshot which takes
> quite a few seconds, it would be even more time consuming.
>
> Nowadays container workloads have hundreds of volumes if not thousands. If
> we want to serve any monitoring solution at this scale (I have seen
> customers use upto 600 volumes at a time, it will only get bigger) and lets
> say collecting metrics per volume takes 2 seconds per volume(Let us take the
> worst example which has all major features enabled like
> snapshot/geo-rep/quota etc etc), that will mean that it will take 20 minutes
> to collect metrics of the cluster with 600 volumes. What are the ways in
> which we can make this number more manageable? I was initially thinking may
> be it is possible to get gd2 to execute commands in parallel on different
> volumes, so potentially we could get this done in ~2 seconds. But quite a
> few of the metrics need a mount or equivalent of a mount(glfs_init) to
> collect different information like statfs, number of pending heals, quota
> usage etc. This may lead to high memory usage as the size of the mounts tend
> to be high.
>
I am not sure if starting from the "worst example" (it certainly is
not) is a good place to start from.I didn't understand your statement. Are you saying 600 volumes is a worst example?That said, for any environment
with that number of disposable volumes, what kind of metrics do
actually make any sense/impact?Same metrics you track for long running volumes. It is just that the way the metricsare interpreted will be different. On a long running volume, you would look at the metricsand try to find why is the volume not giving performance as expected in the last 1 hour. Where asin this case, you would look at metrics and find the reason why volumes that werecreated and deleted in the last hour didn't give performance as expected.--
> I wanted to seek suggestions from others on how to come to a conclusion
> about which path to take and what problems to solve.
>
> I will be happy to raise github issues based on our conclusions on this mail
> thread.
>
> --
> Pranith
>
sankarshan mukhopadhyay
<https://about.me/sankarshan.mukhopadhyay>
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel
--Pranith
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-devel