Hi Paul, Thanks for replying to my query. I am not sure if the benchmark is overloading the cluster as 3 out of 5 runs the benchmark goes around 37K IOPS and suddenly for the problematic runs it drops to 0 IOPS for a couple of minutes and then resumes. This is a test cluster so nothing else is running off it. OSD2 is same as all other OSDs and its always a different OSD every time and on both the nodes. Any pointers? Regards, Shantur On Mon, Apr 30, 2018 at 6:34 PM, Paul Emmerich <paul.emmerich@xxxxxxxx> wrote: > Hi, > > blocked requests are just requests that took longer than 30 seconds to > complete, this just means your cluster is completely overloaded by the > benchmark. > Also, OSD 2 might be slower than your other OSDs. > > Paul > > 2018-04-30 15:36 GMT+02:00 Shantur Rathore <shantur.rathore@xxxxxxxxx>: >> >> Hi all, >> >> I am trying to get my first test Ceph cluster working. >> >> Centos 7 with Elrepo 4.16.3-1.el7.elrepo.x86_64 kernel ( for iSCSI HA ) >> Configured using - ceph-ansible >> 3 Mons ( including 2 OSD nodes ) >> 2 OSD nodes >> 20 OSDs ( 10 per node ) >> >> Each OSD node has 256GB of memory and 2x10GBe Bonded interface. >> For simplicity it uses public network only. >> >> During testing of the cluster from one of the OSD nodes whenever I do >> a test i see slow / blocked requests on both nodes which clear up >> after some time. >> >> I have checked the disks and network and both are working as expected. >> I am trying to read and find ways to see what could be the issue but >> unable to find any fix or solution to the problem. >> >> #Test Command >> [root@storage-29 ~]# rbd bench --io-type write -p test --image disk1 >> --io-pattern seq --io-size 4K --io-total 10G >> >> In this run i saw in "ceph health detail" that osd.2 has blocked >> requests. So i ran >> >> [root@storage-29 ~]# ceph daemon osd.2 dump_blocked_ops >> .. Last op from the output >> >> { >> "description": "osd_op(client.181675.0:933 6.e1 >> 6:8736f1d3:::rbd_data.20d9674b0dc51.00000000000006b7:head [write >> 1150976~4096] snapc 0=[] ondisk+write+known_if_redirected e434)", >> "initiated_at": "2018-04-30 14:04:37.656717", >> "age": 79.228713, >> "duration": 79.230355, >> "type_data": { >> "flag_point": "waiting for sub ops", >> "client_info": { >> "client": "client.181675", >> "client_addr": "10.187.21.212:0/342865484", >> "tid": 933 >> }, >> "events": [ >> { >> "time": "2018-04-30 14:04:37.656717", >> "event": "initiated" >> }, >> { >> "time": "2018-04-30 14:04:37.656789", >> "event": "queued_for_pg" >> }, >> { >> "time": "2018-04-30 14:04:37.656869", >> "event": "reached_pg" >> }, >> { >> "time": "2018-04-30 14:04:37.656917", >> "event": "started" >> }, >> { >> "time": "2018-04-30 14:04:37.656970", >> "event": "waiting for subops from 10" >> }, >> { >> "time": "2018-04-30 14:04:37.669473", >> "event": "op_commit" >> }, >> { >> "time": "2018-04-30 14:04:37.669475", >> "event": "op_applied" >> } >> ] >> } >> } >> >> I checked the logs from >> >> [root@storage-30 ~]# tail -n 1000 /var/log/ceph/ceph-osd.10.log >> >> And around that time nothing is printed in logs >> >> 2018-04-30 13:34:59.986731 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.211:6818/344034 >> conn(0x55b79db85000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 56 vs existing >> csq=55 existing_state=STATE_STANDBY >> 2018-04-30 13:35:00.992309 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560 >> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 9 vs existing >> csq=9 existing_state=STATE_STANDBY >> 2018-04-30 13:35:00.992711 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560 >> conn(0x55b79e3fd000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing >> csq=9 existing_state=STATE_STANDBY >> 2018-04-30 13:35:01.328882 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497 >> conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 7 vs existing >> csq=7 existing_state=STATE_STANDBY >> 2018-04-30 13:35:01.329066 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6821/94497 >> conn(0x55b79e288000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 8 vs existing >> csq=7 existing_state=STATE_STANDBY >> 2018-04-30 14:16:30.622506 7fa906bdc700 0 log_channel(cluster) log >> [DBG] : 6.396 scrub starts >> 2018-04-30 14:16:30.640450 7fa906bdc700 0 log_channel(cluster) log >> [DBG] : 6.396 scrub ok >> 2018-04-30 14:19:32.000798 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6801/93293 >> conn(0x55b79e106000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 10 vs existing >> csq=9 existing_state=STATE_OPEN >> 2018-04-30 14:19:32.030075 7fa91d3c5700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6825/94560 >> conn(0x55b7a120b800 :-1 s=STATE_OPEN pgs=157 cs=11 l=0).fault >> initiating reconnect >> 2018-04-30 14:19:58.492777 7fa91c3c3700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.212:6829/94852 >> conn(0x55b79e4a1000 :-1 s=STATE_OPEN pgs=153 cs=9 l=0).fault >> initiating reconnect >> 2018-04-30 14:19:59.265081 7fa91c3c3700 0 -- >> 10.187.21.211:6810/343380 >> 10.187.21.211:6838/345694 >> conn(0x55b79dba1000 :6810 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH >> pgs=0 cs=0 l=0).handle_connect_msg accept connect_seq 67 vs existing >> csq=67 existing_state=STATE_STANDBY >> >> >> I am not sure what I am missing? >> - Maybe the request packet got lost >> - Maybe i haven't enabled proper logging for osds >> >> Please suggest me how can i go about this problem? >> >> Regards, >> Shantur >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com