Re: Please help me get rid of Slow / blocked requests

Paul Emmerich <paul.emmerich@xxxxxxxx> · Mon, 30 Apr 2018 19:34:36 +0200

Hi,

blocked requests are just requests that took longer than 30 seconds to complete, this just means your cluster is completely overloaded by the benchmark.
Also, OSD 2 might be slower than your other OSDs.

Paul

2018-04-30 15:36 GMT+02:00 Shantur Rathore <shantur.rathore@xxxxxxxxx>:
Hi all,

I am trying to get my first test Ceph cluster working.

Centos 7 with Elrepo 4.16.3-1.el7.elrepo.x86_64 kernel ( for iSCSI HA )

Configured using - ceph-ansible

3 Mons ( including 2 OSD nodes )

2 OSD nodes

20 OSDs ( 10 per node )

Each OSD node has 256GB of memory and 2x10GBe Bonded interface.

For simplicity it uses public network only.

During testing of the cluster from one of the OSD nodes whenever I do

a test i see slow / blocked requests on both nodes which clear up

after some time.

I have checked the disks and network and both are working as expected.

I am trying to read and find ways to see what could be the issue but

unable to find any fix or solution to the problem.

#Test Command

[root@storage-29 ~]# rbd bench --io-type write -p test --image disk1

--io-pattern seq --io-size 4K --io-total 10G

In this run i saw in "ceph health detail" that osd.2 has blocked

requests. So i ran

[root@storage-29 ~]# ceph daemon osd.2 dump_blocked_ops

.. Last op from the output

{

            "description": "osd_op(client.181675.0:933 6.e1

6:8736f1d3:::rbd_data.20d9674b0dc51.00000000000006b7:head [write

1150976~4096] snapc 0=[] ondisk+write+known_if_redirected e434)",

            "initiated_at": "2018-04-30 14:04:37.656717",

            "age": 79.228713,

            "duration": 79.230355,

            "type_data": {

                "flag_point": "waiting for sub ops",

                "client_info": {

                    "client": "client.181675",

                    "client_addr": "10.187.21.212:0/342865484",

                    "tid": 933

                },

                "events": [

                    {

                        "time": "2018-04-30 14:04:37.656717",

                        "event": "initiated"

                    },

                    {

                        "time": "2018-04-30 14:04:37.656789",

                        "event": "queued_for_pg"

                    },

                    {

                        "time": "2018-04-30 14:04:37.656869",

                        "event": "reached_pg"

                    },

                    {

                        "time": "2018-04-30 14:04:37.656917",

                        "event": "started"

                    },

                    {

                        "time": "2018-04-30 14:04:37.656970",

                        "event": "waiting for subops from 10"

                    },

                    {

                        "time": "2018-04-30 14:04:37.669473",

                        "event": "op_commit"

                    },

                    {

                        "time": "2018-04-30 14:04:37.669475",

                        "event": "op_applied"

                    }

                ]

            }

        }

I checked the logs from

[root@storage-30 ~]# tail -n 1000 /var/log/ceph/ceph-osd.10.log

And around that time nothing is printed in logs

2018-04-30 13:34:59.986731 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.211:6818/344034

conn(0x55b79db85000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 56 vs existing

csq=55 existing_state=STATE_STANDBY

2018-04-30 13:35:00.992309 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560

conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 9 vs existing

csq=9 existing_state=STATE_STANDBY

2018-04-30 13:35:00.992711 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560

conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing

csq=9 existing_state=STATE_STANDBY

2018-04-30 13:35:01.328882 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497

conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 7 vs existing

csq=7 existing_state=STATE_STANDBY

2018-04-30 13:35:01.329066 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497

conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 8 vs existing

csq=7 existing_state=STATE_STANDBY

2018-04-30 14:16:30.622506 7fa906bdc700  0 log_channel(cluster) log

[DBG] : 6.396 scrub starts

2018-04-30 14:16:30.640450 7fa906bdc700  0 log_channel(cluster) log

[DBG] : 6.396 scrub ok

2018-04-30 14:19:32.000798 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6801/93293

conn(0x55b79e106000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing

csq=9 existing_state=STATE_OPEN

2018-04-30 14:19:32.030075 7fa91d3c5700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560

conn(0x55b7a120b800 :-1 s=STATE_OPEN pgs=157 cs=11 l=0).fault

initiating reconnect

2018-04-30 14:19:58.492777 7fa91c3c3700  0 --

10.187.21.211:6810/343380 >> 10.187.21.212:6829/94852

conn(0x55b79e4a1000 :-1 s=STATE_OPEN pgs=153 cs=9 l=0).fault

initiating reconnect

2018-04-30 14:19:59.265081 7fa91c3c3700  0 --

10.187.21.211:6810/343380 >> 10.187.21.211:6838/345694

conn(0x55b79dba1000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH

pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 67 vs existing

csq=67 existing_state=STATE_STANDBY

I am not sure what I am missing?

- Maybe the request packet got lost

- Maybe i haven't enabled proper logging for osds

Please suggest me how can i go about this problem?

Regards,

Shantur

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com