Re: Ideas on the UI/UX improvement of ceph-mgr: Cluster Status Dashboard

Wido den Hollander <wido@xxxxxxxx> · Tue, 27 Jun 2017 11:05:43 +0200 (CEST)

> Op 26 juni 2017 om 6:49 schreef saumay agrawal <saumay.agrawal@xxxxxxxxx>:
> 
> 
> Hi everyone!
> 
> I am working on the improvement of the web-based dashboard for Ceph.
> My intention is to add some UI elements to visualise some performance
> counters of a Ceph cluster. This gives a better overview to the users
> of the dashboard about how the Ceph cluster is performing and, if
> necessary, where they can make necessary optimisations to get even
> better performance from the cluster.
> 
> Here is my suggestion on the two perf counters, commit latency and
> apply latency. They are visualised using line graphs. I have prepared
> UI mockups for the same.
> 1. OSD apply latency
> [https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8]
> 2. OSD commit latency
> [https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU]
> 
> These mockups show the latency values (y-axis) against the instant of
> time (x-axis). The latency values for different OSDs are highlighted
> using different colours. The average latency value of all OSDs is
> shown specifically in red. This representation allows the dashboard
> user to compare the performances of an OSD with other OSDs, as well as
> with the average performance of the cluster.

Is avg really the best way to go? Most big/large clusters (100s, 1000s of OSDs) always have a few OSDs idle, so they will report a latency of 0ms.

0 brings down an average pretty fast. Isn't something like the median a better way to go?

> 
> The line width in these graphs is specially kept less, so as to give a
> crisp and clear representation for more number of OSDs. However, this
> approach may clutter the graph and make it incomprehensible for a
> cluster having significantly higher number of OSDs. For such
> situations, we can retain only the average latency indications from
> both the graphs to make things more simple for the dashboard user.
> 

I see a fair amount of clusters with > 250 and < 2500 OSDs, plotting them all in one graph isn't fancy imho.

Maybe only pick the 5% best and 5% worst OSDs?

> Also, higher latency values suggest bad performance. We can come up
> with some specific values for both the counters, above which we can
> say that the cluster is performing very bad. If the value of any of
> the OSDs exceeds this value, we can highlight entire graph in a light
> red shade to draw the attention of user towards it.
> 

I'd say that anything >10ms is already pretty bad.

> I am planning to use AJAX based templates and plugins (like
> Flotcharts) for these graphs. This would allow real-time update of the
> graphs without having any need to reload the entire dashboard page.
> 
> Another feature I propose to add is the representation of the version
> distribution of all the clients in a cluster. This can be categorised
> into distribution
> 1. on the basis of ceph version
> [https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and,
> 2. on the basis of kernel version
> [https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc]
> 
> I have used doughnut charts instead of regular pie charts, as they
> have some whitespace at their centre. This whitespace makes the chart
> appear less cluttered, while properly indicating the appropriate
> fraction of the total value. Also, we can later add some data to
> display at this centre space when we hover over a particular slice of
> the chart.
> 
> The main purpose of this visualisation is to identify any number of
> clients left behind while updating the clients of the cluster. Suppose
> a cluster has 50 clients running ceph jewel. In the process of
> updating this cluster, 40 clients get updated to ceph luminous, while
> the other 10 clients remain behind on ceph jewel. This may occur due
> to some bug or any interruption in the update process. In such
> scenarios, the user can find which clients have not been updated and
> update them according to his needs.  It may also give a clear picture
> for troubleshooting, during any package dependency issues due to the
> kernel. The clients are represented in both, absolutes numbers as well
> as the percentage of the entire cluster, for a better overview.
> 
> An interesting approach could be highlighting the older version(s)
> specifically to grab the attention of the user. For example, a user
> running ceph jewel may not need to update as necessarily compared to
> the user running ceph hammer.
> 
> As of now, I am looking for plugins in AdminLTE to implement these two
> elements in the dashboard. I would like to have feedbacks and
> suggestions on these two from the ceph community, on how can I make
> them more informative about the cluster.
> 
> Also a request to the various ceph users and developers. It would be
> great if you could share the various metrics you are using as a
> performance indicator for your cluster, and how you are using them.
> Any metrics being used to identify the issues in a cluster can also be
> shared.
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com