Turn up debug logging, at least debug_osd 20, and search for the operation in the osd logs. On Thu, Jun 14, 2018 at 5:38 PM, Frank (lists) <lists@xxxxxxxxxxx> wrote: > Hi, > > On a small cluster (3 nodes) I frequently have slow requests. When dumping > the inflight ops from the hanging OSD, it seems it doesn't get a 'response' > for one of the subops. The events always look like: > > "events": [ > { > "time": "2018-06-14 07:10:07.256196", > "event": "initiated" > }, > { > "time": "2018-06-14 07:10:07.256671", > "event": "queued_for_pg" > }, > { > "time": "2018-06-14 07:10:07.256745", > "event": "reached_pg" > }, > { > "time": "2018-06-14 07:10:07.256826", > "event": "started" > }, > { > "time": "2018-06-14 07:10:07.256924", > "event": "waiting for subops from 18,20" > }, > { > "time": "2018-06-14 07:10:07.263769", > "event": "op_commit" > }, > { > "time": "2018-06-14 07:10:07.263775", > "event": "op_applied" > }, > { > "time": "2018-06-14 07:10:07.269989", > "event": "sub_op_commit_rec from 18" > } > ] > > The OSD id's are not the same. Looking at osd.20, the OSD process runs, it > accepts requests ('ceph tell osd.20 bench' runs fine). When I restart the > process for the OSD, the requests is completed. > I could not find any pattern on which OSD is too blame (always an other one) > or one of the servers, it's also differs. > > The cluster runs Ceph 7.5 with 'ceph version 12.2.5 > (cad919881333ac92274171586c827e01f554a70a) luminous (stable)'. It's just a > testcluster with very little activity. What could be a cause of an > (replica)OSD not replying? > > Regards, > > Frank de Bot > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Cheers, Brad _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com