Fwd: Slow requests troubleshooting in Luminous - details missing

Дробышевский, Владимир <vlad@xxxxxxxxxx> · Mon, 12 Mar 2018 16:53:53 +0500

I was following this conversation on tracker and got the same question. I've got a situation with slow requests and had no any idea on how to find the reason. Finally I found it but only because I knew I've upgraded Mellanox drivers on one host, and just decided to check IB config (and the root was there: adapter switched into the datagram mode). But if it wouldn't be the reason I would really lost.

12 мар. 2018 г. 9:39 пользователь "Alex Gorbachev" <ag@xxxxxxxxxxxxxxxxxxx> написал:
On Mon, Mar 5, 2018 at 11:20 PM, Brad Hubbard <bhubbard@xxxxxxxxxx> wrote:

> On Fri, Mar 2, 2018 at 3:54 PM, Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:

>> On Thu, Mar 1, 2018 at 10:57 PM, David Turner <drakonstein@xxxxxxxxx> wrote:

>>> Blocked requests and slow requests are synonyms in ceph. They are 2 names

>>> for the exact same thing.

>>>

>>>

>>> On Thu, Mar 1, 2018, 10:21 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> wrote:

>>>>

>>>> On Thu, Mar 1, 2018 at 2:47 PM, David Turner <drakonstein@xxxxxxxxx>

>>>> wrote:

>>>> > `ceph health detail` should show you more information about the slow

>>>> > requests.  If the output is too much stuff, you can grep out for blocked

>>>> > or

>>>> > something.  It should tell you which OSDs are involved, how long they've

>>>> > been slow, etc.  The default is for them to show '> 32 sec' but that may

>>>> > very well be much longer and `ceph health detail` will show that.

>>>>

>>>> Hi David,

>>>>

>>>> Thank you for the reply.  Unfortunately, the health detail only shows

>>>> blocked requests.  This seems to be related to a compression setting

>>>> on the pool, nothing in OSD logs.

>>>>

>>>> I replied to another compression thread.  This makes sense since

>>>> compression is new, and in the past all such issues were reflected in

>>>> OSD logs and related to either network or OSD hardware.

>>>>

>>>> Regards,

>>>> Alex

>>>>

>>>> >

>>>> > On Thu, Mar 1, 2018 at 2:23 PM Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx>

>>>> > wrote:

>>>> >>

>>>> >> Is there a switch to turn on the display of specific OSD issues?  Or

>>>> >> does the below indicate a generic problem, e.g. network and no any

>>>> >> specific OSD?

>>>> >>

>>>> >> 2018-02-28 18:09:36.438300 7f6dead56700  0

>>>> >> mon.roc-vm-sc3c234@0(leader).data_health(46) update_stats avail 56%

>>>> >> total 15997 MB, used 6154 MB, avail 9008 MB

>>>> >> 2018-02-28 18:09:41.477216 7f6dead56700  0 log_channel(cluster) log

>>>> >> [WRN] : Health check failed: 73 slow requests are blocked > 32 sec

>>>> >> (REQUEST_SLOW)

>>>> >> 2018-02-28 18:09:47.552669 7f6dead56700  0 log_channel(cluster) log

>>>> >> [WRN] : Health check update: 74 slow requests are blocked > 32 sec

>>>> >> (REQUEST_SLOW)

>>>> >> 2018-02-28 18:09:53.794882 7f6de8551700  0

>>>> >> mon.roc-vm-sc3c234@0(leader) e1 handle_command mon_command({"prefix":

>>>> >> "status", "format": "json"} v 0) v1

>>>> >>

>>>> >> --

>>

>> I was wrong where the pool compression does not matter, even

>> uncompressed pool also generates these slow messages.

>>

>> Question is why no subsequent message relating to specific OSDs (like

>> in Jewel and prior, like this example from RH:

>>

>> 2015-08-24 13:18:10.024659 osd.1 127.0.0.1:6812/3032 9 : cluster [WRN]

>> 6 slow requests, 6 included below; oldest blocked for > 61.758455 secs

>>

>> 2016-07-25 03:44:06.510583 osd.50 [WRN] slow request 30.005692 seconds

>> old, received at {date-time}: osd_op(client.4240.0:8

>> benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4

>> currently waiting for subops from [610]

>>

>> In comparison, my Luminous cluster only shows the general slow/blocked message:

>>

>> 2018-03-01 21:52:54.237270 7f7e419e3700  0 log_channel(cluster) log

>> [WRN] : Health check failed: 116 slow requests are blocked > 32 sec

>> (REQUEST_SLOW)

>> 2018-03-01 21:53:00.282721 7f7e419e3700  0 log_channel(cluster) log

>> [WRN] : Health check update: 66 slow requests are blocked > 32 sec

>> (REQUEST_SLOW)

>> 2018-03-01 21:53:08.534244 7f7e419e3700  0 log_channel(cluster) log

>> [WRN] : Health check update: 5 slow requests are blocked > 32 sec

>> (REQUEST_SLOW)

>> 2018-03-01 21:53:10.382510 7f7e419e3700  0 log_channel(cluster) log

>> [INF] : Health check cleared: REQUEST_SLOW (was: 5 slow requests are

>> blocked > 32 sec)

>> 2018-03-01 21:53:10.382546 7f7e419e3700  0 log_channel(cluster) log

>> [INF] : Cluster is now healthy

>>

>> So where are the details?

>

> Working on this, thanks.

>

> See https://tracker.ceph.com/issues/23236

As in the tracker, but I think it would be useful to others:

If something like below comes up, how do you troubleshoot the cause of

past events, especially if this is specific to just a handful of 1000s

of OSDs on many hosts?

2018-03-11 22:00:00.000132 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK

2018-03-11 22:44:46.173825 mon.roc-vm-sc3c234 [WRN] Health check

failed: 12 slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-03-11 22:44:52.245738 mon.roc-vm-sc3c234 [WRN] Health check

update: 9 slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-03-11 22:44:57.925686 mon.roc-vm-sc3c234 [WRN] Health check

update: 10 slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-03-11 22:45:02.926031 mon.roc-vm-sc3c234 [WRN] Health check

update: 14 slow requests are blocked > 32 sec (REQUEST_SLOW)

2018-03-11 22:45:06.413741 mon.roc-vm-sc3c234 [INF] Health check

cleared: REQUEST_SLOW (was: 14 slow requests are blocked > 32 sec)

2018-03-11 22:45:06.413814 mon.roc-vm-sc3c234 [INF] Cluster is now healthy

2018-03-11 23:00:00.000136 mon.roc-vm-sc3c234 [INF] overall HEALTH_OK

Thank you,

Alex

>

>>

>> Thanks,

>> Alex

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

> --

> Cheers,

> Brad

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

С уважением,
Дробышевский Владимир
Компания "АйТи Город"
+7 343 2222192

ИТ-консалтинг
Поставка проектов "под ключ"
Аутсорсинг ИТ-услуг
Аутсорсинг ИТ-инфраструктуры

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com