On Fri, Mar 2, 2018 at 9:56 AM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote: > > On Fri, Mar 2, 2018 at 4:17 AM Maged Mokhtar <mmokhtar@xxxxxxxxxxx> wrote: >> >> On 2018-03-02 07:54, Alex Gorbachev wrote: >> >> On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonstein@xxxxxxxxx> >> wrote: >> >> Blocked requests and slow requests are synonyms in ceph. They are 2 names >> for the exact same thing. >> >> >> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> >> wrote: >> >> >> On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonstein@xxxxxxxxx> >> wrote: >> >> `ceph health detail` should show you more information about the slow >> requests. If the output is too much stuff, you can grep out for blocked >> or >> something. It should tell you which OSDs are involved, how long they've >> been slow, etc. The default is for them to show '> 32 sec' but that may >> very well be much longer and `ceph health detail` will show that. >> >> >> Hi David, >> >> Thank you for the reply. Unfortunately, the health detail only shows >> blocked requests. This seems to be related to a compression setting >> on the pool, nothing in OSD logs. >> >> I replied to another compression thread. This makes sense since >> compression is new, and in the past all such issues were reflected in >> OSD logs and related to either network or OSD hardware. >> >> Regards, >> Alex >> >> >> On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> >> wrote: >> >> >> Is there a switch to turn on the display of specific OSD issues? Or >> does the below indicate a generic problem, e.g. network and no any >> specific OSD? >> >> 2018-02-28 18:09:36.438300 7f6dead56700 0 >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56% >> total 15997 MB, used 6154 MB, avail 9008 MB >> 2018-02-28 18:09:41.477216 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:47.552669 7f6dead56700 0 log_channel(cluster) log >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-02-28 18:09:53.794882 7f6de8551700 0 >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix": >> "status", "format": "json"} v 0) v1 >> >> -- >> >> >> I was wrong where the pool compression does not matter, even >> uncompressed pool also generates these slow messages. >> >> Question is why no subsequent message relating to specific OSDs (like >> in Jewel and prior, like this example from RH: >> >> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN] >> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs >> >> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds >> old, received at {date-time}: osd_op(client.4240.0:8 >> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 >> currently waiting for subops from [610] >> >> In comparison, my Luminous cluster only shows the general slow/blocked >> message: >> >> 2018-03-01 21:52:54.237270 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:00.282721 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 66 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:08.534244 7f7e419e3700 0 log_channel(cluster) log >> [WRN] : Health check update: 5 slow requests are blocked > 32 sec >> (REQUEST_SLOW) >> 2018-03-01 21:53:10.382510 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are >> blocked > 32 sec) >> 2018-03-01 21:53:10.382546 7f7e419e3700 0 log_channel(cluster) log >> [INF] : Cluster is now healthy >> >> So where are the details? >> >> Thanks, >> Alex >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> Hi Alex, >> >> I agree, the health detail should display osd info. >> Regarding the slow requests: ESX is very sensitive to io delay, this could >> be due to cluster overload or disk going bad. If you have a monitoring tool, >> do you see high load spikes (such as disk % busy ) on your disks ? Generally >> it is good (if you can) simulate the max client workload + simulate >> concurrent recovery and scrubbing and monitor your resources and see if they >> get overloaded. Most if not all issues with ESX will go away if you have >> enough hardware. I suspect this is what you are see-ing, specially the >> VMotion and compression could be the extra load which triggered the issue, >> in such case the problem is not specific to a particular osd. The other >> possibility is that a particular disk is going bad, the monitoring tool >> should be able to identify this. >> >> Note that VMotion generates an io pattern that keeps hitting the same osd >> with concurrent small writes, if you can use rbd striping, creating 64k >> stripes may help. A controller with write back cache will also help. >> >> Maged > > > Thank you, Maged. This does seem to be related to compression. I did two > things - turned off compression on pool and tried vmotion to same image - > failed. > > Then I made another image and that time it worked. So maybe some compressed > parts of the old image introduced slowness. > > To be fair, I also set all Arecas to 75% buffer, so will be testing more > today if that is the case. > > All OSDs look healthy. Will update...but yes odd that no OSD detail display > with slow requests. May need a tracker for that. Further investigation: when journaling, exclusive-lock are set on image, this is when the slowdowns begin. Unsetting these features makes everything work fine. I opened the tracker at https://tracker.ceph.com/issues/23205 > > Best regards, > Alex > > -- > -- > Alex Gorbachev > Storcium _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com