> Op 26 juni 2017 om 6:49 schreef saumay agrawal <saumay.agrawal@xxxxxxxxx>: > > > Hi everyone! > > I am working on the improvement of the web-based dashboard for Ceph. > My intention is to add some UI elements to visualise some performance > counters of a Ceph cluster. This gives a better overview to the users > of the dashboard about how the Ceph cluster is performing and, if > necessary, where they can make necessary optimisations to get even > better performance from the cluster. > > Here is my suggestion on the two perf counters, commit latency and > apply latency. They are visualised using line graphs. I have prepared > UI mockups for the same. > 1. OSD apply latency > [https://drive.google.com/open?id=0ByXy5gIBzlhYNS1MbTJJRDhtSG8] > 2. OSD commit latency > [https://drive.google.com/open?id=0ByXy5gIBzlhYNElyVU00TGtHeVU] > > These mockups show the latency values (y-axis) against the instant of > time (x-axis). The latency values for different OSDs are highlighted > using different colours. The average latency value of all OSDs is > shown specifically in red. This representation allows the dashboard > user to compare the performances of an OSD with other OSDs, as well as > with the average performance of the cluster. Is avg really the best way to go? Most big/large clusters (100s, 1000s of OSDs) always have a few OSDs idle, so they will report a latency of 0ms. 0 brings down an average pretty fast. Isn't something like the median a better way to go? > > The line width in these graphs is specially kept less, so as to give a > crisp and clear representation for more number of OSDs. However, this > approach may clutter the graph and make it incomprehensible for a > cluster having significantly higher number of OSDs. For such > situations, we can retain only the average latency indications from > both the graphs to make things more simple for the dashboard user. > I see a fair amount of clusters with > 250 and < 2500 OSDs, plotting them all in one graph isn't fancy imho. Maybe only pick the 5% best and 5% worst OSDs? > Also, higher latency values suggest bad performance. We can come up > with some specific values for both the counters, above which we can > say that the cluster is performing very bad. If the value of any of > the OSDs exceeds this value, we can highlight entire graph in a light > red shade to draw the attention of user towards it. > I'd say that anything >10ms is already pretty bad. > I am planning to use AJAX based templates and plugins (like > Flotcharts) for these graphs. This would allow real-time update of the > graphs without having any need to reload the entire dashboard page. > > Another feature I propose to add is the representation of the version > distribution of all the clients in a cluster. This can be categorised > into distribution > 1. on the basis of ceph version > [https://drive.google.com/open?id=0ByXy5gIBzlhYYmw5cXF2bkdTWWM] and, > 2. on the basis of kernel version > [https://drive.google.com/open?id=0ByXy5gIBzlhYczFuRTBTRDcwcnc] > > I have used doughnut charts instead of regular pie charts, as they > have some whitespace at their centre. This whitespace makes the chart > appear less cluttered, while properly indicating the appropriate > fraction of the total value. Also, we can later add some data to > display at this centre space when we hover over a particular slice of > the chart. > > The main purpose of this visualisation is to identify any number of > clients left behind while updating the clients of the cluster. Suppose > a cluster has 50 clients running ceph jewel. In the process of > updating this cluster, 40 clients get updated to ceph luminous, while > the other 10 clients remain behind on ceph jewel. This may occur due > to some bug or any interruption in the update process. In such > scenarios, the user can find which clients have not been updated and > update them according to his needs. It may also give a clear picture > for troubleshooting, during any package dependency issues due to the > kernel. The clients are represented in both, absolutes numbers as well > as the percentage of the entire cluster, for a better overview. > > An interesting approach could be highlighting the older version(s) > specifically to grab the attention of the user. For example, a user > running ceph jewel may not need to update as necessarily compared to > the user running ceph hammer. > > As of now, I am looking for plugins in AdminLTE to implement these two > elements in the dashboard. I would like to have feedbacks and > suggestions on these two from the ceph community, on how can I make > them more informative about the cluster. > > Also a request to the various ceph users and developers. It would be > great if you could share the various metrics you are using as a > performance indicator for your cluster, and how you are using them. > Any metrics being used to identify the issues in a cluster can also be > shared. > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com