Hello ceph-users and ceph-devel list, we got in production with our new shiny luminous (12.2.5) cluster. This cluster runs SSD and HDD based OSD pools. To ensure the service quality of the cluster and to have a baseline for client latency optimization (i.e. in the area of deepscrub optimization) we would like to have statistics about the client interaction latency of our cluster. Which measures can be suitable to get such a "aggregated by device_class" average latency KPI? Also a percentile rank would be great (% amount of requests serviced by < 5ms, % amount of requests serviced by < 20ms, % amount of requests serviced by < 50ms, ...) The following command provides a overview over the commit latency of the osds but no average latency and no information about the device_class. ceph osd perf -f json-pretty { "osd_perf_infos": [ { "id": 71, "perf_stats": { "commit_latency_ms": 2, "apply_latency_ms": 0 } }, { "id": 70, "perf_stats": { "commit_latency_ms": 3, "apply_latency_ms": 0 } Device class information can be extracted of "ceph df -f json-pretty". But building averages of averages not seems to be a good thing .... :-) It seems that i can get more detailed information using the "ceph daemon osd.<nr> perf histogram dump" command. This seems to deliver the percentile rank information in a good detail level. (http://docs.ceph.com/docs/luminous/dev/perf_histograms/) My questions: Are there tools to analyze and aggregate these measures for a group of OSDs? Which measures should i use as a baseline for client latency optimization? What is the time horizon of these measures? I sometimes see messages like this in my log. This seems to be sourced in deep scrubbing. How can find the source/solution of this problem? 2018-07-11 16:58:55.064497 mon.ceph-mon-s43 [INF] Cluster is now healthy 2018-07-11 16:59:15.141214 mon.ceph-mon-s43 [WRN] Health check failed: 4 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:25.037707 mon.ceph-mon-s43 [WRN] Health check update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:30.038001 mon.ceph-mon-s43 [WRN] Health check update: 23 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:35.210900 mon.ceph-mon-s43 [WRN] Health check update: 27 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:45.038718 mon.ceph-mon-s43 [WRN] Health check update: 29 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:50.038955 mon.ceph-mon-s43 [WRN] Health check update: 39 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 16:59:55.281279 mon.ceph-mon-s43 [WRN] Health check update: 44 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 17:00:00.000121 mon.ceph-mon-s43 [WRN] overall HEALTH_WARN 12 slow requests are blocked > 32 sec 2018-07-11 17:00:05.039677 mon.ceph-mon-s43 [WRN] Health check update: 12 slow requests are blocked > 32 sec (REQUEST_SLOW) 2018-07-11 17:00:09.329897 mon.ceph-mon-s43 [INF] Health check cleared: REQUEST_SLOW (was: 12 slow requests are blocked > 32 sec) 2018-07-11 17:00:09.329919 mon.ceph-mon-s43 [INF] Cluster is now healthy Regards Marc _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com