Hello, in addition to what Somnath wrote, if you're seeing this kind of blocking reads _and_ have slow write warnings in the logs, your cluster is likely either unhealthy and/or underpowered for it's current load. If your cluster is healthy, you may want to investigate what's busy, my guess is the OSDs/HDDs. Also any scrubs my drag your performance down especially deep-scrubs. However this doesn't really explain any difference between Juno and Havanna, as both should suffer from a sickly Ceph cluster. Christian On Wed, 13 May 2015 06:51:56 +0000 Somnath Roy wrote: > Can you give some more insight about the ceph cluster you are running ? > It seems IO started and then no response..cur MB/s is becoming 0s.. > What is ‘ceph –s’ output ? > Hope all the OSDs are up and running.. > > Thanks & Regards > Somnath > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of > changqian zuo Sent: Tuesday, May 12, 2015 9:00 PM > To: ceph-users@xxxxxxxxxxxxxx > Subject: How to debug a ceph read performance problem? > > Hi, guys, > > We have been running an OpenStack Havana environment with Ceph 0.72.2 as > block storage backend. Recently we were trying to upgrade OpenStack to > Juno. For testing, we deployed a Juno all-in-one node, this node share > the same Cinder volume rbd pool and Glance image rbd pool with the old > Havana. > > And after some test, we found a serious read performance problem in Juno > client (write is just OK), something like: > > # rados bench -p test 30 seq > sec Cur ops started finished avg MB/s cur MB/s last lat avg > lat 0 0 0 0 0 0 - 0 > 1 16 100 84 335.843 336 0.020221 > 0.0393582 2 16 100 84 167.944 0 - > 0.0393582 3 16 100 84 111.967 0 - > 0.0393582 4 16 100 84 83.9769 0 - > 0.0393582 5 16 100 84 67.1826 0 - > 0.0393582 6 16 100 84 55.9863 0 - > 0.0393582 7 16 100 84 47.9886 0 - > 0.0393582 8 16 100 84 41.9905 0 - > 0.0393582 9 16 100 84 37.3249 0 - > 0.0393582 10 16 100 84 33.5926 0 - > 0.0393582 11 16 100 84 30.5388 0 - > 0.0393582 12 16 100 84 27.9938 0 - > 0.0393582 13 16 100 84 25.8405 0 - > 0.0393582 14 16 100 84 23.9948 0 - > 0.0393582 15 16 100 84 22.3952 0 - > 0.0393582 > > And when testing RBD image with fio (bs=512k read), there are: > > # grep 12067 ceph.client.log | grep read > 2015-05-11 16:19:36.649554 7ff9949d5a00 1 -- > 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> --> > 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> -- > osd_op(client.3772684.0:12067 rbd_data.262a6e7bf17801.0000000000000003 > [sparse-read 2621440~524288] 7.c43a3ae3 e240302) v4 -- ?+0 > 0x7ff9967c5fb0 con 0x7ff99a41c420 2015-05-11 16:20:07.709915 > 7ff94bfff700 1 -- 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> > <== osd.218 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> 111 > ==== osd_op_reply(12067 rbd_data.262a6e7bf17801.0000000000000003 > [sparse-read 2621440~524288] v0'0 uv3803266 ondisk = 0) v6 ==== > 199+0+524312 (3484234903 0 0) 0x7ff3a4002ba0 con 0x7ff99a41c420 > > Some operation takes more an minute. > > I checked OSD log (default logging level, ceph.com<http://ceph.com> said > when a request takes too long, it will complain in log), and do see some > slow 4k write request, but no read. > > We have tested Giant, Firefly, and self-built Emperor client, same sad > results. > > The network between OSD and all-in-one node is 10Gb network, this is > from client to OSD: > > # iperf3 -c 10.10.11.25 -t 60 -i 1 > Connecting to host 10.10.11.25, port 5201 > [ 4] local 10.10.11.15 port 41202 connected to 10.10.11.25 port 5201 > [ ID] Interval Transfer Bandwidth Retr Cwnd > [ 4] 0.00-1.00 sec 1.09 GBytes 9.32 Gbits/sec 11 2.02 MBytes > [ 4] 1.00-2.00 sec 1.09 GBytes 9.35 Gbits/sec 34 1.53 MBytes > [ 4] 2.00-3.00 sec 1.09 GBytes 9.35 Gbits/sec 11 1.14 MBytes > [ 4] 3.00-4.00 sec 1.09 GBytes 9.37 Gbits/sec 0 1.22 MBytes > [ 4] 4.00-5.00 sec 1.09 GBytes 9.34 Gbits/sec 0 1.27 MBytes > > and OSD to client (there may be some problem in client interface > bonding, 10Gb could not by reached): > > # iperf3 -c 10.10.11.15 -t 60 -i 1 > Connecting to host 10.10.11.15, port 5201 > [ 4] local 10.10.11.25 port 43934 connected to 10.10.11.15 port 5201 > [ ID] Interval Transfer Bandwidth Retr Cwnd > [ 4] 0.00-1.00 sec 400 MBytes 3.35 Gbits/sec 1 337 KBytes > [ 4] 1.00-2.00 sec 553 MBytes 4.63 Gbits/sec 1 341 KBytes > [ 4] 2.00-3.00 sec 390 MBytes 3.27 Gbits/sec 1 342 KBytes > [ 4] 3.00-4.00 sec 395 MBytes 3.32 Gbits/sec 0 342 KBytes > [ 4] 4.00-5.00 sec 541 MBytes 4.54 Gbits/sec 0 346 KBytes > [ 4] 5.00-6.00 sec 405 MBytes 3.40 Gbits/sec 0 358 KBytes > [ 4] 6.00-7.00 sec 728 MBytes 6.11 Gbits/sec 1 370 KBytes > [ 4] 7.00-8.00 sec 741 MBytes 6.22 Gbits/sec 0 355 KBytes > > Ceph cluster is shared by this Juno and old Havana (as mentioned, they > use exactly same rbd pool), and IO on Havana just goes fine. Any > suggestion or advice? So that we can make sure it is an issue of client, > network, or ceph cluster and then go on. I am new to Ceph, need some > help. > > Thanks > > > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message > is intended only for the use of the designated recipient(s) named above. > If the reader of this message is not the intended recipient, you are > hereby notified that you have received this message in error and that > any review, dissemination, distribution, or copying of this message is > strictly prohibited. If you have received this communication in error, > please notify the sender by telephone or e-mail (as shown above) > immediately and destroy any and all copies of this message in your > possession (whether hard copies or electronically stored copies). > -- Christian Balzer Network/Systems Engineer chibi@xxxxxxx Global OnLine Japan/Fusion Communications http://www.gol.com/ _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com