It looks like osd.4 may actually be the problem. Can you try removing osd.4 and trying again? -Sam On Mon, Sep 2, 2013 at 8:01 AM, Mariusz Gronczewski <mariusz.gronczewski@xxxxxxxxxxxxx> wrote: > We've installed ceph on test cluster: > 3x mon, 7xOSD on 2x10k RPM SAS > Centos 6.4 ( 2.6.32-358.14.1.el6.x86_64 ) > ceph 0.67.2 (also tried with 0.61.7 with same results) > > And during rados bench I get very strange behaviour: > # rados bench -p pbench 100 write > > sec Cur ops started finished avg MB/s cur MB/s last lat avg lat > ... > 51 16 1503 1487 116.603 72 0.306585 0.524611 > 52 16 1525 1509 116.053 88 0.171904 0.520352 > 53 16 1541 1525 115.07 64 0.121784 0.516466 > 54 16 1541 1525 112.939 0 - 0.516466 > 55 16 1541 1525 110.885 0 - 0.516466 > 56 16 1541 1525 108.905 0 - 0.516466 > 57 16 1541 1525 106.994 0 - 0.516466 > ... ( http://pastebin.com/vV50YBVK ) > > Bandwidth (MB/sec): 81.760 > > Stddev Bandwidth: 53.8371 > Max bandwidth (MB/sec): 156 > Min bandwidth (MB/sec): 0 > Average Latency: 0.782271 > Stddev Latency: 2.51829 > Max latency: 26.1715 > Min latency: 0.084654 > > basically benchmark goes at full disk speed and then it stops any I/O for 10+ seconds > > During that time all IO and cpu load on all nodes basically stops and ceph -w starts to report: > > 2013-09-02 16:44:57.794115 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for > 62.953663 secs > 2013-09-02 16:44:57.794125 osd.4 [WRN] slow request 60.363101 seconds old, received at 2013-09-02 16:43:57.430961: osd_op(client.381797.0:2109 benchmark_data_hqblade203.non.3dart.com_18829_object2108 [write 0~4194304] 14.745012c3 e277) v4 currently waiting for subops from [0] > 2013-09-02 16:45:01.795211 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for > 66.954773 secs > 2013-09-02 16:45:01.795221 osd.4 [WRN] slow request 60.661060 seconds old, received at 2013-09-02 16:44:01.134112: osd_op(client.381797.0:2199 benchmark_data_hqblade203.non.3dart.com_18829_object2198 [write 0~4194304] 14.dec41e60 e277) v4 currently waiting for subops from [0] > 2013-09-02 16:45:02.795582 osd.4 [WRN] 6 slow requests, 2 included below; oldest blocked for > 67.955102 secs > 2013-09-02 16:45:02.795590 osd.4 [WRN] slow request 60.316291 seconds old, received at 2013-09-02 16:44:02.479210: osd_op(client.381797.0:2230 benchmark_data_hqblade203.non.3dart.com_18829_object2229 [write 0~4194304] 14.b3ca5505 e277) v4 currently waiting for subops from [0] > 2013-09-02 16:45:02.795595 osd.4 [WRN] slow request 60.014792 seconds old, received at 2013-09-02 16:44:02.780709: osd_op(client.381797.0:2234 benchmark_data_hqblade203.non.3dart.com_18829_object2233 [write 0~4194304] 14.a8c8cfd5 e277) v4 currently waiting for subops from [0] > 2013-09-02 16:45:03.723742 osd.0 [WRN] 10 slow requests, 1 included below; oldest blocked for > 69.571037 secs > 2013-09-02 16:45:03.723748 osd.0 [WRN] slow request 60.871583 seconds > old, received at 2013-09-02 16:44:02.852110: > osd_op(client.381797.0:2235 > benchmark_data_hqblade203.non.3dart.com_18829_object2234 [write > 0~4194304] 14.d44b2ab6 e277) v4 currently waiting for subops from [4] > > Any ideas why it is happening and how it can be debugged ? it seems that there is something wrong with osd.0 but there doesnt seem to be anything wrong with machine itself (bonnie++ and dd on machine does not show up any lockups) > > -- > Mariusz Gronczewski, Administrator > > Efigence Sp. z o. o. > ul. Wołoska 9a, 02-583 Warszawa > T: [+48] 22 380 13 13 > F: [+48] 22 380 13 14 > E: mariusz.gronczewski@xxxxxxxxxxxx > <mailto:mariusz.gronczewski@xxxxxxxxxxxx> > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com