Re: Slow requests troubleshooting in Luminous - details missing

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote:

On 2018-03-02 07:54, Alex Gorbachev wrote:

On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonstein@xxxxxxxxx> wrote:
Blocked requests and slow requests are synonyms in ceph. They are 2 names
for the exact same thing.


On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:

On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonstein@xxxxxxxxx>
wrote:
`ceph health detail` should show you more information about the slow
requests.  If the output is too much stuff, you can grep out for blocked
or
something.  It should tell you which OSDs are involved, how long they've
been slow, etc.  The default is for them to show '> 32 sec' but that may
very well be much longer and `ceph health detail` will show that.

Hi David,

Thank you for the reply.  Unfortunately, the health detail only shows
blocked requests.  This seems to be related to a compression setting
on the pool, nothing in OSD logs.

I replied to another compression thread.  This makes sense since
compression is new, and in the past all such issues were reflected in
OSD logs and related to either network or OSD hardware.

Regards,
Alex


On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>
wrote:

Is there a switch to turn on the display of specific OSD issues?  Or
does the below indicate a generic problem, e.g. network and no any
specific OSD?

2018-02-28 18:09:36.438300 7f6dead56700  0
mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%
total 15997 MB, used 6154 MB, avail 9008 MB
2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check failed: 73 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log
[WRN] : Health check update: 74 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-02-28 18:09:53.794882 7f6de8551700  0
mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":
"status", "format": "json"} v 0) v1

--

I was wrong where the pool compression does not matter, even
uncompressed pool also generates these slow messages.

Question is why no subsequent message relating to specific OSDs (like
in Jewel and prior, like this example from RH:

2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]
6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds
old, received at {date-time}: osd_op(client.4240.0:8
benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4
currently waiting for subops from [610]

In comparison, my Luminous cluster only shows the general slow/blocked message:

2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check failed: 116 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 66 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log
[WRN] : Health check update: 5 slow requests are blocked > 32 sec
(REQUEST_SLOW)
2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log
[INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are
blocked > 32 sec)
2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log
[INF] : Cluster is now healthy

So where are the details?

Thanks,
Alex
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 
 
Hi Alex,

I agree, the health detail should display osd info.
Regarding the slow requests: ESX is very sensitive to io delay, this could be due to cluster overload or disk going bad. If you have a monitoring tool, do you see high load spikes (such as disk % busy ) on your disks ? Generally it is good (if you can) simulate the max client workload + simulate concurrent recovery and scrubbing and monitor your resources and see if they get overloaded. Most if not all issues with ESX will go away if you have enough hardware. I suspect this is what you are see-ing, specially the VMotion and compression could be the extra load which triggered the issue, in such case the problem is not specific to a particular osd. The other possibility is that a particular disk is going bad, the monitoring tool should be able to identify this.
 
Note that VMotion generates an io pattern that keeps hitting the same osd with concurrent small writes, if you can use rbd striping, creating 64k stripes may help. A controller with write back cache will also help.

Maged

Thank you, Maged. This does seem to be related to compression. I did two things - turned off compression on pool and tried vmotion to same image - failed. 

Then I made another image and that time it worked. So maybe some compressed parts of the old image introduced slowness. 

To be fair, I also set all Arecas to 75% buffer, so will be testing more today if that is the case. 

All OSDs look healthy. Will update...but yes odd that no OSD detail display with slow requests.  May need a tracker for that. 

Best regards,
Alex
--
--
Alex Gorbachev
Storcium
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux