Re: Frequent slow requests

Brad Hubbard <bhubbard@xxxxxxxxxx> · Fri, 15 Jun 2018 07:02:02 +1000

Turn up debug logging, at least debug_osd 20, and search for the
operation in the osd logs.

On Thu, Jun 14, 2018 at 5:38 PM, Frank (lists) <lists@xxxxxxxxxxx> wrote:
> Hi,
>
> On a small cluster (3 nodes) I frequently have slow requests. When dumping
> the inflight ops from the hanging OSD, it seems it doesn't get a 'response'
> for one of the subops. The events always look like:
>
>                 "events": [
>                     {
>                         "time": "2018-06-14 07:10:07.256196",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.256671",
>                         "event": "queued_for_pg"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.256745",
>                         "event": "reached_pg"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.256826",
>                         "event": "started"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.256924",
>                         "event": "waiting for subops from 18,20"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.263769",
>                         "event": "op_commit"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.263775",
>                         "event": "op_applied"
>                     },
>                     {
>                         "time": "2018-06-14 07:10:07.269989",
>                         "event": "sub_op_commit_rec from 18"
>                     }
>                  ]
>
> The OSD id's are not the same. Looking at osd.20, the OSD process runs, it
> accepts requests ('ceph tell osd.20 bench' runs fine). When I restart the
> process for the OSD, the requests is completed.
> I could not find any pattern on which OSD is too blame (always an other one)
> or one of the servers, it's also differs.
>
> The cluster runs Ceph 7.5 with 'ceph version 12.2.5
> (cad919881333ac92274171586c827e01f554a70a) luminous (stable)'. It's just a
> testcluster with very little activity. What could be a cause of an
> (replica)OSD not replying?
>
> Regards,
>
> Frank de Bot
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com