Re: REQUEST_SLOW across many OSDs at the same time

"mart.v" <mart.v@xxxxxxxxx> · Mon, 25 Feb 2019 21:26:06 +0100 (CET)

- As far as I understand the reported 'implicated osds' are only the primary ones. In the log of the osds you should find also the relevant pg number, and with this information you can get all the involved OSDs. This might be useful e.g. to see if a specific OSD node is always involved. This was my case (a the problem was with the patch cable connecting the node)

I can see right from the REQUEST_SLOW error log lines implicated OSDs and therefore I can tell which nodes are involved. It is indeed on all nodes in a cluster, no exception. So it cannot be linked to one specific node.

- You can use the "ceph daemon osd.x dump_historic_ops" command to debug some of these slow requests (to see which events take much time)

2019-02-25 17:40:49.550303 > initiated
2019-02-25 17:40:49.550338 > queued_for_pg
2019-02-25 17:40:49.550924 > reached_pg
2019-02-25 17:40:49.550950 > started
2019-02-25 17:40:49.550989 > waiting for subops from 21,35
2019-02-25 17:40:49.552316 > op_commit
2019-02-25 17:40:49.552320 > op_applied
2019-02-25 17:40:49.553216 > sub_op_commit_rec from 21
2019-02-25 17:41:18.416662 > sub_op_commit_rec from 35
2019-02-25 17:41:18.416708 > commit_sent
2019-02-25 17:41:18.416726 > done 

I'm not sure how to read this output  - the time is start or finish? Does it mean that it is waiting for OSD 21 or 35? I tried to examine few different OSDs for dump_historic_ops, they all seems to wait on other OSDs. But there is no similarity (OSD numbers are different).

Best,
Martin

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com