Re: Slow requests troubleshooting in Luminous - details missing

Maged Mokhtar <mmokhtar@xxxxxxxxxxx> · Fri, 02 Mar 2018 11:16:57 +0200

On 2018-03-02 07:54, Alex Gorbachev wrote:

On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
Blocked requests and slow requests are synonyms in ceph. They are 2 names
 for the exact same thing.

 On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:

 On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonstein@xxxxxxxxx>
 wrote:
`ceph health detail` should show you more information about the slow
 requests.  If the output is too much stuff, you can grep out for blocked
 or
 something.  It should tell you which OSDs are involved, how long they've
 been slow, etc.  The default is for them to show '> 32 sec' but that may
 very well be much longer and `ceph health detail` will show that.

 Hi David,

 Thank you for the reply.  Unfortunately, the health detail only shows
 blocked requests.  This seems to be related to a compression setting
 on the pool, nothing in OSD logs.

 I replied to another compression thread.  This makes sense since
 compression is new, and in the past all such issues were reflected in
 OSD logs and related to either network or OSD hardware.

 Regards,
 Alex

 On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
 wrote:

 Is there a switch to turn on the display of specific OSD issues?  Or
 does the below indicate a generic problem, e.g. network and no any
 specific OSD?

 2018-02-28 18:09:36.438300 7f6dead56700  0
 mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
 total 15997 MB, used 6154 MB, avail 9008 MB
 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
 [WRN] : Health check failed: 73 slow requests are blocked > 32 sec
 (REQUEST_SLOW)
 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
 [WRN] : Health check update: 74 slow requests are blocked > 32 sec
 (REQUEST_SLOW)
 2018-02-28 18:09:53.794882 7f6de8551700  0
 mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
 "status", "format": "json"} v 0) v1

 --

 I was wrong where the pool compression does not matter, even
 uncompressed pool also generates these slow messages.

 Question is why no subsequent message relating to specific OSDs (like
 in Jewel and prior, like this example from RH:

 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
 old, received at {date-time}: osd_op(client.4240.0:8
 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
 currently waiting for subops from [610]

 In comparison, my Luminous cluster only shows the general slow/blocked message:

 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
 [WRN] : Health check failed: 116 slow requests are blocked > 32 sec
 (REQUEST_SLOW)
 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
 [WRN] : Health check update: 66 slow requests are blocked > 32 sec
 (REQUEST_SLOW)
 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
 [WRN] : Health check update: 5 slow requests are blocked > 32 sec
 (REQUEST_SLOW)
 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
 [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
 blocked > 32 sec)
 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
 [INF] : Cluster is now healthy

 So where are the details?

 Thanks,
 Alex
 _______________________________________________
 ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Alex,

I agree, the health detail should display osd info.
Regarding the slow requests: ESX is very sensitive to io delay, this could be due to cluster overload or disk going bad. If you have a monitoring tool, do you see high load spikes (such as disk % busy ) on your disks ? Generally it is good (if you can) simulate the max client workload + simulate concurrent recovery and scrubbing and monitor your resources and see if they get overloaded. Most if not all issues with ESX will go away if you have enough hardware. I suspect this is what you are see-ing, specially the VMotion and compression could be the extra load which triggered the issue, in such case the problem is not specific to a particular osd. The other possibility is that a particular disk is going bad, the monitoring tool should be able to identify this. 

Note that VMotion generates an io pattern that keeps hitting the same osd with concurrent small writes, if you can use rbd striping, creating 64k stripes may help. A controller with write back cache will also help.

Maged

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com