in retrospect get OSD for "slow requests are blocked" ? / get detailed health status via librados?

Uwe Sauter <uwe.sauter.de@xxxxxxxxx> · Wed, 16 May 2018 10:16:05 +0200

Hi folks,

I'm currently chewing on an issue regarding "slow requests are blocked". I'd like to identify the OSD that is causing those events
once the cluster is back to HEALTH_OK (as I have no monitoring yet that would get this info in realtime).

Collecting this information could help identify aging disks if you were able to accumulate and analyze which OSD had blocking
requests in the past and how often those events occur.

My research so far let's me think that this information is only available as long as the requests are actually blocked. Is this
correct?

MON logs only show that those events occure and how many requests are in blocking state but no indication of which OSD is
affected. Is there a way to identify blocking requests from the OSD log files?

On a side note: I was trying to write a small Python script that would extract this kind of information in realtime but while I
was able to register a MonitorLog callback that would receive the same messages as you would get with "ceph -w" I haven's seen in
the librados Python bindings documentation the possibility to do the equivalent of "ceph health detail". Any suggestions on how to
get the blocking OSDs via librados?

Thanks,

	Uwe
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com