+1 on SMART tracking
On Mon, Jun 26, 2017 at 5:19 AM, Massimiliano Cuttini <max@xxxxxxxxxxxxx> wrote:
Hi Saumay,
i think you should take in account to track SMART on every SSD founded.
If it has SMART capabilities, then track its test (or commit tests) and display its values on the dashboard (or separate graph).
This allow ADMINS to forecast the next OSD will die.
Preventing is better than Restoring! :)
Il 26/06/2017 06:49, saumay agrawal ha scritto:
Hi everyone!
I am working on the improvement of the web-based dashboard for Ceph.
My intention is to add some UI elements to visualise some performance
counters of a Ceph cluster. This gives a better overview to the users
of the dashboard about how the Ceph cluster is performing and, if
necessary, where they can make necessary optimisations to get even
better performance from the cluster.
Here is my suggestion on the two perf counters, commit latency and
apply latency. They are visualised using line graphs. I have prepared
UI mockups for the same.
1. OSD apply latency
[https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtS ]G8
2. OSD commit latency
[https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHe ]VU
These mockups show the latency values (y-axis) against the instant of
time (x-axis). The latency values for different OSDs are highlighted
using different colours. The average latency value of all OSDs is
shown specifically in red. This representation allows the dashboard
user to compare the performances of an OSD with other OSDs, as well as
with the average performance of the cluster.
The line width in these graphs is specially kept less, so as to give a
crisp and clear representation for more number of OSDs. However, this
approach may clutter the graph and make it incomprehensible for a
cluster having significantly higher number of OSDs. For such
situations, we can retain only the average latency indications from
both the graphs to make things more simple for the dashboard user.
Also, higher latency values suggest bad performance. We can come up
with some specific values for both the counters, above which we can
say that the cluster is performing very bad. If the value of any of
the OSDs exceeds this value, we can highlight entire graph in a light
red shade to draw the attention of user towards it.
I am planning to use AJAX based templates and plugins (like
Flotcharts) for these graphs. This would allow real-time update of the
graphs without having any need to reload the entire dashboard page.
Another feature I propose to add is the representation of the version
distribution of all the clients in a cluster. This can be categorised
into distribution
1. on the basis of ceph version
[https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTW ] and,WM
2. on the basis of kernel version
[https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwc ]nc
I have used doughnut charts instead of regular pie charts, as they
have some whitespace at their centre. This whitespace makes the chart
appear less cluttered, while properly indicating the appropriate
fraction of the total value. Also, we can later add some data to
display at this centre space when we hover over a particular slice of
the chart.
The main purpose of this visualisation is to identify any number of
clients left behind while updating the clients of the cluster. Suppose
a cluster has 50 clients running ceph jewel. In the process of
updating this cluster, 40 clients get updated to ceph luminous, while
the other 10 clients remain behind on ceph jewel. This may occur due
to some bug or any interruption in the update process. In such
scenarios, the user can find which clients have not been updated and
update them according to his needs. It may also give a clear picture
for troubleshooting, during any package dependency issues due to the
kernel. The clients are represented in both, absolutes numbers as well
as the percentage of the entire cluster, for a better overview.
An interesting approach could be highlighting the older version(s)
specifically to grab the attention of the user. For example, a user
running ceph jewel may not need to update as necessarily compared to
the user running ceph hammer.
As of now, I am looking for plugins in AdminLTE to implement these two
elements in the dashboard. I would like to have feedbacks and
suggestions on these two from the ceph community, on how can I make
them more informative about the cluster.
Also a request to the various ceph users and developers. It would be
great if you could share the various metrics you are using as a
performance indicator for your cluster, and how you are using them.
Any metrics being used to identify the issues in a cluster can also be
shared.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com