We've installed ceph on test cluster: 3x mon, 7xOSD on 2x10k RPM SAS Centos 6.4 ( 2.6.32-358.14.1.el6.x86_64 ) ceph 0.67.2 (also tried with 0.61.7 with same results) And during rados bench I get very strange behaviour: # rados bench -p pbench 100 write sec Cur ops started finished avg MB/s cur MB/s last lat avg lat ... 51 16 1503 1487 116.603 72 0.306585 0.524611 52 16 1525 1509 116.053 88 0.171904 0.520352 53 16 1541 1525 115.07 64 0.121784 0.516466 54 16 1541 1525 112.939 0 - 0.516466 55 16 1541 1525 110.885 0 - 0.516466 56 16 1541 1525 108.905 0 - 0.516466 57 16 1541 1525 106.994 0 - 0.516466 ... ( http://pastebin.com/vV50YBVK ) Bandwidth (MB/sec): 81.760 Stddev Bandwidth: 53.8371 Max bandwidth (MB/sec): 156 Min bandwidth (MB/sec): 0 Average Latency: 0.782271 Stddev Latency: 2.51829 Max latency: 26.1715 Min latency: 0.084654 basically benchmark goes at full disk speed and then it stops any I/O for 10+ seconds During that time all IO and cpu load on all nodes basically stops and ceph -w starts to report: 2013-09-02 16:44:57.794115 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for > 62.953663 secs 2013-09-02 16:44:57.794125 osd.4 [WRN] slow request 60.363101 seconds old, received at 2013-09-02 16:43:57.430961: osd_op(client.381797.0:2109 benchmark_data_hqblade203.non.3dart.com_18829_object2108 [write 0~4194304] 14.745012c3 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:01.795211 osd.4 [WRN] 6 slow requests, 1 included below; oldest blocked for > 66.954773 secs 2013-09-02 16:45:01.795221 osd.4 [WRN] slow request 60.661060 seconds old, received at 2013-09-02 16:44:01.134112: osd_op(client.381797.0:2199 benchmark_data_hqblade203.non.3dart.com_18829_object2198 [write 0~4194304] 14.dec41e60 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:02.795582 osd.4 [WRN] 6 slow requests, 2 included below; oldest blocked for > 67.955102 secs 2013-09-02 16:45:02.795590 osd.4 [WRN] slow request 60.316291 seconds old, received at 2013-09-02 16:44:02.479210: osd_op(client.381797.0:2230 benchmark_data_hqblade203.non.3dart.com_18829_object2229 [write 0~4194304] 14.b3ca5505 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:02.795595 osd.4 [WRN] slow request 60.014792 seconds old, received at 2013-09-02 16:44:02.780709: osd_op(client.381797.0:2234 benchmark_data_hqblade203.non.3dart.com_18829_object2233 [write 0~4194304] 14.a8c8cfd5 e277) v4 currently waiting for subops from [0] 2013-09-02 16:45:03.723742 osd.0 [WRN] 10 slow requests, 1 included below; oldest blocked for > 69.571037 secs 2013-09-02 16:45:03.723748 osd.0 [WRN] slow request 60.871583 seconds old, received at 2013-09-02 16:44:02.852110: osd_op(client.381797.0:2235 benchmark_data_hqblade203.non.3dart.com_18829_object2234 [write 0~4194304] 14.d44b2ab6 e277) v4 currently waiting for subops from [4] Any ideas why it is happening and how it can be debugged ? it seems that there is something wrong with osd.0 but there doesnt seem to be anything wrong with machine itself (bonnie++ and dd on machine does not show up any lockups) -- Mariusz Gronczewski, Administrator Efigence Sp. z o. o. ul. Wołoska 9a, 02-583 Warszawa T: [+48] 22 380 13 13 F: [+48] 22 380 13 14 E: mariusz.gronczewski@xxxxxxxxxxxx <mailto:mariusz.gronczewski@xxxxxxxxxxxx>
Attachment:
signature.asc
Description: PGP signature
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com