Re: Debugging 'slow requests' ...

Brad Hubbard <bhubbard@xxxxxxxxxx> · Sat, 9 Feb 2019 08:50:11 +1000

Try capturing another log with debug_ms turned up. 1 or 5 should be Ok
to start with.

On Fri, Feb 8, 2019 at 8:37 PM Massimo Sgaravatto
<massimo.sgaravatto@xxxxxxxxx> wrote:
>
> Our Luminous ceph cluster have been worked without problems for a while, but in the last days we have been suffering from continuous slow requests.
>
> We have indeed done some changes in the infrastructure recently:
>
> - Moved OSD nodes to a new switch
> - Increased pg nums for a pool, to have about ~ 100 PGs/OSD (also because  we have to install new OSDs in the cluster). The output of 'ceph osd df' is attached.
>
> The problem could also be due to some ''bad' client, but in the log I don't see a clear "correlation" with specific clients or images for such blocked requests.
>
> I also tried to update to latest luminous release and latest CentOS7, but this didn't help.
>
>
>
> Attached you can find the detail of one of such slow operations which took about 266 secs (output from 'ceph daemon osd.11 dump_historic_ops').
> So as far as I can understand from these events:
>                     {
>                         "time": "2019-02-08 10:26:25.651728",
>                         "event": "op_commit"
>                     },
>                     {
>                         "time": "2019-02-08 10:26:25.651965",
>                         "event": "op_applied"
>                     },
>
>                   {
>                         "time": "2019-02-08 10:26:25.653236",
>                         "event": "sub_op_commit_rec from 33"
>                     },
>                     {
>                         "time": "2019-02-08 10:30:51.890404",
>                         "event": "sub_op_commit_rec from 23"
>                     },
>
> the problem seems with the  "sub_op_commit_rec from 23" event which took too much.
> So the problem is that the answer from OSD 23 took to much ?
>
>
> In the logs of the 2 OSD (11 and 23)in that time frame (attached) I can't find anything useful.
> When the problem happened the load and the usage of memory was not high in the relevant nodes.
>
>
> Any help to debug the issue is really appreciated ! :-)
>
> Thanks, Massimo
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com