Re: Fwd: Slow requests troubleshooting in Luminous - details missing

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Fri, 16 Mar 2018 00:24:19 -0400



On Mon, Mar 12, 2018 at 12:21 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:
> On Mon, Mar 12, 2018 at 7:53 AM, Дробышевский, Владимир <vlad@xxxxxxxxxx> wrote:
>>
>> I was following this conversation on tracker and got the same question. I've
>> got a situation with slow requests and had no any idea on how to find the
>> reason. Finally I found it but only because I knew I've upgraded Mellanox
>> drivers on one host, and just decided to check IB config (and the root was
>> there: adapter switched into the datagram mode). But if it wouldn't be the
>> reason I would really lost.
>>
>> 12 мар. 2018 г. 9:39 пользователь "Alex Gorbachev" <ag@xxxxxxxxxxxxxxxxxxx>
>> написал:
>>
>>> On Mon, Mar 5, 2018 at 11:20 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:
>>> > On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
>>> > wrote:
>>> >> On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonstein@xxxxxxxxx>
>>> >> wrote:
>>> >>> Blocked requests and slow requests are synonyms in ceph. They are 2
>>> >>> names
>>> >>> for the exact same thing.
>>> >>>
>>> >>>
>>> >>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
>>> >>> wrote:
>>> >>>>
>>> >>>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonstein@xxxxxxxxx>
>>> >>>> wrote:
>>> >>>> > `ceph health detail` should show you more information about the
>>> >>>> > slow
>>> >>>> > requests.  If the output is too much stuff, you can grep out for
>>> >>>> > blocked
>>> >>>> > or
>>> >>>> > something.  It should tell you which OSDs are involved, how long
>>> >>>> > they've
>>> >>>> > been slow, etc.  The default is for them to show '> 32 sec' but
>>> >>>> > that may
>>> >>>> > very well be much longer and `ceph health detail` will show that.
>>> >>>>
>>> >>>> Hi David,
>>> >>>>
>>> >>>> Thank you for the reply.  Unfortunately, the health detail only shows
>>> >>>> blocked requests.  This seems to be related to a compression setting
>>> >>>> on the pool, nothing in OSD logs.
>>> >>>>
>>> >>>> I replied to another compression thread.  This makes sense since
>>> >>>> compression is new, and in the past all such issues were reflected in
>>> >>>> OSD logs and related to either network or OSD hardware.
>>> >>>>
>>> >>>> Regards,
>>> >>>> Alex
>>> >>>>
>>> >>>> >
>>> >>>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev
>>> >>>> > <ag@xxxxxxxxxxxxxxxxxxx>
>>> >>>> > wrote:
>>> >>>> >>
>>> >>>> >> Is there a switch to turn on the display of specific OSD issues?
>>> >>>> >> Or
>>> >>>> >> does the below indicate a generic problem, e.g. network and no any
>>> >>>> >> specific OSD?
>>> >>>> >>
>>> >>>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0
>>> >>>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail
>>> >>>> >> 56%
>>> >>>> >> total 15997 MB, used 6154 MB, avail 9008 MB
>>> >>>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster)
>>> >>>> >> log
>>> >>>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
>>> >>>> >> (REQUEST_SLOW)
>>> >>>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster)
>>> >>>> >> log
>>> >>>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec
>>> >>>> >> (REQUEST_SLOW)
>>> >>>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0
>>> >>>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command
>>> >>>> >> mon_command({"prefix":
>>> >>>> >> "status", "format": "json"} v 0) v1
>>> >>>> >>
>>> >>>> >> --
>>> >>
>>> >> I was wrong where the pool compression does not matter, even
>>> >> uncompressed pool also generates these slow messages.
>>> >>
>>> >> Question is why no subsequent message relating to specific OSDs (like
>>> >> in Jewel and prior, like this example from RH:
>>> >>
>>> >> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
>>> >> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs
>>> >>
>>> >> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
>>> >> old, received at {date-time}: osd_op(client.4240.0:8
>>> >> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
>>> >> currently waiting for subops from [610]
>>> >>
>>> >> In comparison, my Luminous cluster only shows the general slow/blocked
>>> >> message:
>>> >>
>>> >> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
>>> >> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
>>> >> [WRN] : Health check update: 66 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
>>> >> [WRN] : Health check update: 5 slow requests are blocked > 32 sec
>>> >> (REQUEST_SLOW)
>>> >> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
>>> >> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
>>> >> blocked > 32 sec)
>>> >> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
>>> >> [INF] : Cluster is now healthy
>>> >>
>>> >> So where are the details?
>>> >
>>> > Working on this, thanks.
>>> >
>>> > See https://tracker.ceph.com/issues/23236
>>>
>>> As in the tracker, but I think it would be useful to others:
>>>
>>> If something like below comes up, how do you troubleshoot the cause of
>>> past events, especially if this is specific to just a handful of 1000s
>>> of OSDs on many hosts?
>>>
>>> 2018-03-11 22:00:00.000132 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK
>>> 2018-03-11 22:44:46.173825 mon.roc-vm-sc3c234 [WRN] Health check
>>> failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>> 2018-03-11 22:44:52.245738 mon.roc-vm-sc3c234 [WRN] Health check
>>> update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>> 2018-03-11 22:44:57.925686 mon.roc-vm-sc3c234 [WRN] Health check
>>> update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>> 2018-03-11 22:45:02.926031 mon.roc-vm-sc3c234 [WRN] Health check
>>> update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)
>>> 2018-03-11 22:45:06.413741 mon.roc-vm-sc3c234 [INF] Health check
>>> cleared: REQUEST_SLOW (was: 14 slow requests are blocked > 32 sec)
>>> 2018-03-11 22:45:06.413814 mon.roc-vm-sc3c234 [INF] Cluster is now healthy
>>> 2018-03-11 23:00:00.000136 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK
>>>
>>> Thank you,
>>> Alex
>
> I am in this situation now, just seeing the slow requests here and
> there, no hint or clue as to which OSD this is on.  OSD logs do not
> show anything indicative of a problem.  Network is fine, hosts are
> fine, how to troubleshoot?

Another one today:
ceph-mon.roc-vm-sc3c234.log:2018-03-15 23:10:28.381685 7fa0e9ee2700  0
log_channel(cluster) log [WRN] : Health check failed: 1 slow requests
are blocked > 32 sec (REQUEST_SLOW)
ceph-mon.roc-vm-sc3c234.log:2018-03-15 23:10:32.412299 7fa0e9ee2700  0
log_channel(cluster) log [INF] : Health check cleared: REQUEST_SLOW
(was: 1 slow requests are blocked > 32 sec)

All blocked requests in OSDs report 0
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com