Re: Please help me get rid of Slow / blocked requests

Shantur Rathore <shantur.rathore@xxxxxxxxx> · Tue, 1 May 2018 11:00:12 +0100

Hi Paul,

Thanks for replying to my query.
I am not sure if the benchmark is overloading the cluster as 3 out of
5 runs the benchmark goes around 37K IOPS and suddenly for the
problematic runs it drops to 0 IOPS for a couple of minutes and then
resumes. This is a test cluster so nothing else is running off it.

OSD2 is same as all other OSDs and its always a different OSD every
time and on both the nodes.

Any pointers?

Regards,
Shantur

On Mon, Apr 30, 2018 at 6:34 PM, Paul Emmerich <paul.emmerich@xxxxxxxx> wrote:
> Hi,
>
> blocked requests are just requests that took longer than 30 seconds to
> complete, this just means your cluster is completely overloaded by the
> benchmark.
> Also, OSD 2 might be slower than your other OSDs.
>
> Paul
>
> 2018-04-30 15:36 GMT+02:00 Shantur Rathore <shantur.rathore@xxxxxxxxx>:
>>
>> Hi all,
>>
>> I am trying to get my first test Ceph cluster working.
>>
>> Centos 7 with Elrepo 4.16.3-1.el7.elrepo.x86_64 kernel ( for iSCSI HA )
>> Configured using - ceph-ansible
>> 3 Mons ( including 2 OSD nodes )
>> 2 OSD nodes
>> 20 OSDs ( 10 per node )
>>
>> Each OSD node has 256GB of memory and 2x10GBe Bonded interface.
>> For simplicity it uses public network only.
>>
>> During testing of the cluster from one of the OSD nodes whenever I do
>> a test i see slow / blocked requests on both nodes which clear up
>> after some time.
>>
>> I have checked the disks and network and both are working as expected.
>> I am trying to read and find ways to see what could be the issue but
>> unable to find any fix or solution to the problem.
>>
>> #Test Command
>> [root@storage-29 ~]# rbd bench --io-type write -p test --image disk1
>> --io-pattern seq --io-size 4K --io-total 10G
>>
>> In this run i saw in "ceph health detail" that osd.2 has blocked
>> requests. So i ran
>>
>> [root@storage-29 ~]# ceph daemon osd.2 dump_blocked_ops
>> .. Last op from the output
>>
>> {
>>             "description": "osd_op(client.181675.0:933 6.e1
>> 6:8736f1d3:::rbd_data.20d9674b0dc51.00000000000006b7:head [write
>> 1150976~4096] snapc 0=[] ondisk+write+known_if_redirected e434)",
>>             "initiated_at": "2018-04-30 14:04:37.656717",
>>             "age": 79.228713,
>>             "duration": 79.230355,
>>             "type_data": {
>>                 "flag_point": "waiting for sub ops",
>>                 "client_info": {
>>                     "client": "client.181675",
>>                     "client_addr": "10.187.21.212:0/342865484",
>>                     "tid": 933
>>                 },
>>                 "events": [
>>                     {
>>                         "time": "2018-04-30 14:04:37.656717",
>>                         "event": "initiated"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.656789",
>>                         "event": "queued_for_pg"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.656869",
>>                         "event": "reached_pg"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.656917",
>>                         "event": "started"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.656970",
>>                         "event": "waiting for subops from 10"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.669473",
>>                         "event": "op_commit"
>>                     },
>>                     {
>>                         "time": "2018-04-30 14:04:37.669475",
>>                         "event": "op_applied"
>>                     }
>>                 ]
>>             }
>>         }
>>
>> I checked the logs from
>>
>> [root@storage-30 ~]# tail -n 1000 /var/log/ceph/ceph-osd.10.log
>>
>> And around that time nothing is printed in logs
>>
>> 2018-04-30 13:34:59.986731 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.211:6818/344034
>> conn(0x55b79db85000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 56 vs existing
>> csq=55 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:00.992309 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560
>> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 9 vs existing
>> csq=9 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:00.992711 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560
>> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing
>> csq=9 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:01.328882 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497
>> conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 7 vs existing
>> csq=7 existing_state=STATE_STANDBY
>> 2018-04-30 13:35:01.329066 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497
>> conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 8 vs existing
>> csq=7 existing_state=STATE_STANDBY
>> 2018-04-30 14:16:30.622506 7fa906bdc700  0 log_channel(cluster) log
>> [DBG] : 6.396 scrub starts
>> 2018-04-30 14:16:30.640450 7fa906bdc700  0 log_channel(cluster) log
>> [DBG] : 6.396 scrub ok
>> 2018-04-30 14:19:32.000798 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6801/93293
>> conn(0x55b79e106000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing
>> csq=9 existing_state=STATE_OPEN
>> 2018-04-30 14:19:32.030075 7fa91d3c5700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560
>> conn(0x55b7a120b800 :-1 s=STATE_OPEN pgs=157 cs=11 l=0).fault
>> initiating reconnect
>> 2018-04-30 14:19:58.492777 7fa91c3c3700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.212:6829/94852
>> conn(0x55b79e4a1000 :-1 s=STATE_OPEN pgs=153 cs=9 l=0).fault
>> initiating reconnect
>> 2018-04-30 14:19:59.265081 7fa91c3c3700  0 --
>> 10.187.21.211:6810/343380 >> 10.187.21.211:6838/345694
>> conn(0x55b79dba1000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH
>> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 67 vs existing
>> csq=67 existing_state=STATE_STANDBY
>>
>>
>> I am not sure what I am missing?
>> - Maybe the request packet got lost
>> - Maybe i haven't enabled proper logging for osds
>>
>> Please suggest me how can i go about this problem?
>>
>> Regards,
>> Shantur
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com