Re: Ideas on the UI/UX improvement of ceph-mgr: Cluster Status Dashboard

Brady Deetz <bdeetz@xxxxxxxxx> · Mon, 26 Jun 2017 08:52:21 -0500

+1 on SMART tracking

On Mon, Jun 26, 2017 at 5:19 AM, Massimiliano Cuttini <max@xxxxxxxxxxxxx> wrote:
Hi Saumay,

i think you should take in account to track SMART on every SSD founded.

If it has SMART capabilities, then track its test (or commit tests) and display its values on the dashboard (or separate graph).

This allow ADMINS to forecast the next OSD will die.

Preventing is better than Restoring! :)

Il 26/06/2017 06:49, saumay agrawal ha scritto:

Hi everyone!

I am working on the improvement of the web-based dashboard for Ceph.

My intention is to add some UI elements to visualise some performance

counters of a Ceph cluster. This gives a better overview to the users

of the dashboard about how the Ceph cluster is performing and, if

necessary, where they can make necessary optimisations to get even

better performance from the cluster.

Here is my suggestion on the two perf counters, commit latency and

apply latency. They are visualised using line graphs. I have prepared

UI mockups for the same.

1. OSD apply latency

[https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8]

2. OSD commit latency

[https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU]

These mockups show the latency values (y-axis) against the instant of

time (x-axis). The latency values for different OSDs are highlighted

using different colours. The average latency value of all OSDs is

shown specifically in red. This representation allows the dashboard

user to compare the performances of an OSD with other OSDs, as well as

with the average performance of the cluster.

The line width in these graphs is specially kept less, so as to give a

crisp and clear representation for more number of OSDs. However, this

approach may clutter the graph and make it incomprehensible for a

cluster having significantly higher number of OSDs. For such

situations, we can retain only the average latency indications from

both the graphs to make things more simple for the dashboard user.

Also, higher latency values suggest bad performance. We can come up

with some specific values for both the counters, above which we can

say that the cluster is performing very bad. If the value of any of

the OSDs exceeds this value, we can highlight entire graph in a light

red shade to draw the attention of user towards it.

I am planning to use AJAX based templates and plugins (like

Flotcharts) for these graphs. This would allow real-time update of the

graphs without having any need to reload the entire dashboard page.

Another feature I propose to add is the representation of the version

distribution of all the clients in a cluster. This can be categorised

into distribution

1. on the basis of ceph version

[https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and,

2. on the basis of kernel version

[https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc]

I have used doughnut charts instead of regular pie charts, as they

have some whitespace at their centre. This whitespace makes the chart

appear less cluttered, while properly indicating the appropriate

fraction of the total value. Also, we can later add some data to

display at this centre space when we hover over a particular slice of

the chart.

The main purpose of this visualisation is to identify any number of

clients left behind while updating the clients of the cluster. Suppose

a cluster has 50 clients running ceph jewel. In the process of

updating this cluster, 40 clients get updated to ceph luminous, while

the other 10 clients remain behind on ceph jewel. This may occur due

to some bug or any interruption in the update process. In such

scenarios, the user can find which clients have not been updated and

update them according to his needs.  It may also give a clear picture

for troubleshooting, during any package dependency issues due to the

kernel. The clients are represented in both, absolutes numbers as well

as the percentage of the entire cluster, for a better overview.

An interesting approach could be highlighting the older version(s)

specifically to grab the attention of the user. For example, a user

running ceph jewel may not need to update as necessarily compared to

the user running ceph hammer.

As of now, I am looking for plugins in AdminLTE to implement these two

elements in the dashboard. I would like to have feedbacks and

suggestions on these two from the ceph community, on how can I make

them more informative about the cluster.

Also a request to the various ceph users and developers. It would be

great if you could share the various metrics you are using as a

performance indicator for your cluster, and how you are using them.

Any metrics being used to identify the issues in a cluster can also be

shared.

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com