Re: How to debug a ceph read performance problem?

Christian Balzer <chibi@xxxxxxx> · Wed, 13 May 2015 20:16:30 +0900

Hello,

On Wed, 13 May 2015 18:09:46 +0800 changqian zuo wrote:

> Thanks for your time,
> 
> I write a  Python script and re-analysing client messaging log, list all
> OSD ip and the time used for the read request to this OSD, and find slow
> replies are all from node 10.10.11.12. So I do a network test, and there
> is problem:
> 
> # iperf3 -c 10.10.11.15 -t 60 -i 1
> Connecting to host 10.10.11.15, port 5201
> [  4] local 10.10.11.12 port 53944 connected to 10.10.11.15 port 5201
> [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> [  4]   0.00-1.00   sec   123 KBytes  1.01 Mbits/sec   20   4.24 KBytes
> 
> [  4]   1.00-2.00   sec  67.9 KBytes   556 Kbits/sec   16   2.83 KBytes
> 
> [  4]   2.00-3.00   sec   192 KBytes  1.58 Mbits/sec   23   4.24 KBytes
> 
> [  4]   3.00-4.00   sec  22.6 KBytes   185 Kbits/sec   11   4.24 KBytes
> 
> [  4]   4.00-5.00   sec  76.4 KBytes   626 Kbits/sec   18   4.24 KBytes
> 
> [  4]   5.00-6.00   sec  25.5 KBytes   209 Kbits/sec   10   4.24 KBytes
> 
> [  4]   6.00-7.00   sec   107 KBytes   880 Kbits/sec   15   2.83 KBytes
> 
> [  4]   7.00-8.00   sec   127 KBytes  1.04 Mbits/sec   23   2.83 KBytes
> 
> [  4]   8.00-9.00   sec   130 KBytes  1.07 Mbits/sec   17   4.24 KBytes
> 
> [  4]   9.00-10.00  sec   119 KBytes   973 Kbits/sec   19   4.24 KBytes
> 
> I've asked ops to tweak this yesterday, now it seems this is not
> completely done. I am still waiting for them to handle this and do
> another test. This may explain why the io rate started high and suddenly
> goes down. But there may still be other issue.
> 
Yes, the moment Ceph needs to write to OSDs on that storage node things
will become unhappy.

> Our cluster have been not very well maintained, "ceph -s" shows:
> 
> [root@controller fio-rbd]# ceph -s
>     cluster 6d2bb752-db69-48ff-9df4-7d85703f322e
>      health HEALTH_WARN
>      monmap e29: 5 mons at {bj-ceph09=
> 10.10.11.22:6789/0,bj-ceph10=10.10.11.23:6789/0,bj-ceph12=10.10.11.25:6789/0,bj-ceph13=10.10.11.26:6789/0,bj-ceph14=10.10.11.27:6789/0},
> election epoch 180736, quorum 0,1,2,3,4
> bj-ceph09,bj-ceph10,bj-ceph12,bj-ceph13,bj-ceph14
>      osdmap e240303: 216 osds: 216 up, 216 in
>       pgmap v31819901: 10048 pgs, 3 pools, 28274 GB data, 4736 kobjects
>             85880 GB used, 111 TB / 195 TB avail
>                10048 active+clean
>   client io 208 kB/s rd, 18808 kB/s wr, 1463 op/s
> 
> health is HEALTH_WARN, but can not tell what goes wrong at first glance.
> 
Neither can I. 
Not sure if that is an artifact of this oldish version of Ceph, or if
something was wrong moments ago and the flag has not been cleaned up.
Going though ceph.log on one of your monitors, bj-ceph09 in particular and
looking for WRN entries should be helpful.

> "ceph osd tree" shows all OSD is up, but there are ceph node without any
> OSD up and running. 

Do these nodes actually exist and are supposed to have OSDs running on
them? 
Because the osdmap at the point of your "ceph -s" was not missing any OSDs.

>And I do see memory in many Ceph storage nodes are
> almost or been run out.
> 
Define "run out". Is the memory actually used by processes?
How much memory and how many OSDs per storage node do you have?

Christian

> I think I will work on these issues soon.
> 
> 
> 
> 
> 2015-05-13 15:17 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> 
> >
> > Hello,
> >
> > in addition to what Somnath wrote, if you're seeing this kind of
> > blocking reads _and_ have slow write warnings in the logs, your
> > cluster is likely either unhealthy and/or underpowered for it's
> > current load.
> >
> > If your cluster is healthy, you may want to investigate what's busy, my
> > guess is the OSDs/HDDs.
> > Also any scrubs my drag your performance down especially deep-scrubs.
> >
> > However this doesn't really explain any difference between Juno and
> > Havanna, as both should suffer from a sickly Ceph cluster.
> >
> > Christian
> >
> > On Wed, 13 May 2015 06:51:56 +0000 Somnath Roy wrote:
> >
> > > Can you give some more insight about the ceph cluster you are
> > > running ? It seems IO started and then no response..cur MB/s is
> > > becoming 0s.. What is ‘ceph –s’ output  ?
> > > Hope all the OSDs are up and running..
> > >
> > > Thanks & Regards
> > > Somnath
> > >
> > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > Behalf Of changqian zuo Sent: Tuesday, May 12, 2015 9:00 PM
> > > To: ceph-users@xxxxxxxxxxxxxx
> > > Subject:  How to debug a ceph read performance problem?
> > >
> > > Hi, guys,
> > >
> > > We have been running an OpenStack Havana environment with Ceph
> > > 0.72.2 as block storage backend. Recently we were trying to upgrade
> > > OpenStack to Juno. For testing, we deployed a Juno all-in-one node,
> > > this node share the same Cinder volume rbd pool and Glance image rbd
> > > pool with the old Havana.
> > >
> > > And after some test, we found a serious read performance problem in
> > > Juno client (write is just OK), something like:
> > >
> > > # rados bench -p test 30 seq
> > >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat
> > > avg lat 0       0         0         0         0         0
> > > -         0 1      16       100        84   335.843       336
> > > 0.020221 0.0393582 2      16       100        84   167.944
> > > 0         - 0.0393582 3      16       100        84
> > > 111.967         0         - 0.0393582 4      16       100
> > > 84   83.9769         0         - 0.0393582 5      16
> > > 100        84   67.1826         0         - 0.0393582 6
> > > 16       100        84   55.9863         0         - 0.0393582
> > > 7      16       100        84   47.9886         0         -
> > > 0.0393582 8      16       100        84   41.9905         0
> > > - 0.0393582 9      16       100        84   37.3249
> > > 0         - 0.0393582 10      16       100        84
> > > 33.5926         0         - 0.0393582 11      16       100
> > > 84   30.5388         0         - 0.0393582 12      16
> > > 100        84   27.9938         0         - 0.0393582 13
> > > 16       100        84   25.8405         0         - 0.0393582
> > > 14      16       100        84   23.9948         0         -
> > > 0.0393582 15      16       100        84   22.3952         0
> > > - 0.0393582
> > >
> > > And when testing RBD image with fio (bs=512k read), there are:
> > >
> > > # grep 12067 ceph.client.log | grep read
> > > 2015-05-11 16:19:36.649554 7ff9949d5a00  1 --
> > > 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> -->
> > > 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> --
> > > osd_op(client.3772684.0:12067
> > > rbd_data.262a6e7bf17801.0000000000000003 [sparse-read
> > > 2621440~524288] 7.c43a3ae3 e240302) v4 -- ?+0 0x7ff9967c5fb0 con
> > > 0x7ff99a41c420 2015-05-11 16:20:07.709915 7ff94bfff700  1 --
> > > 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> <== osd.218
> > > 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> 111 ====
> > > osd_op_reply(12067 rbd_data.262a6e7bf17801.0000000000000003
> > > [sparse-read 2621440~524288] v0'0 uv3803266 ondisk = 0) v6 ====
> > > 199+0+524312 (3484234903 0 0) 0x7ff3a4002ba0 con 0x7ff99a41c420
> > >
> > > Some operation takes more an minute.
> > >
> > > I checked OSD log (default logging level, ceph.com<http://ceph.com>
> > > said when a request takes too long, it will complain in log), and do
> > > see some slow 4k write request, but no read.
> > >
> > > We have tested Giant, Firefly, and self-built Emperor client, same
> > > sad results.
> > >
> > > The network between OSD and all-in-one node is 10Gb network, this is
> > > from client to OSD:
> > >
> > > # iperf3 -c 10.10.11.25 -t 60 -i 1
> > > Connecting to host 10.10.11.25, port 5201
> > > [  4] local 10.10.11.15 port 41202 connected to 10.10.11.25 port 5201
> > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > [  4]   0.00-1.00   sec  1.09 GBytes  9.32 Gbits/sec   11   2.02
> > > MBytes [  4]   1.00-2.00   sec  1.09 GBytes  9.35 Gbits/sec   34
> > > 1.53 MBytes [  4]   2.00-3.00   sec  1.09 GBytes  9.35 Gbits/sec
> > > 11   1.14 MBytes [  4]   3.00-4.00   sec  1.09 GBytes  9.37
> > > Gbits/sec    0   1.22 MBytes [  4]   4.00-5.00   sec  1.09 GBytes
> > > 9.34 Gbits/sec    0   1.27 MBytes
> > >
> > > and OSD to client (there may be some problem in client interface
> > > bonding, 10Gb could not by reached):
> > >
> > > # iperf3 -c 10.10.11.15 -t 60 -i 1
> > > Connecting to host 10.10.11.15, port 5201
> > > [  4] local 10.10.11.25 port 43934 connected to 10.10.11.15 port 5201
> > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > [  4]   0.00-1.00   sec   400 MBytes  3.35 Gbits/sec    1    337
> > > KBytes [  4]   1.00-2.00   sec   553 MBytes  4.63 Gbits/sec    1
> > > 341 KBytes [  4]   2.00-3.00   sec   390 MBytes  3.27 Gbits/sec
> > > 1    342 KBytes [  4]   3.00-4.00   sec   395 MBytes  3.32
> > > Gbits/sec    0    342 KBytes [  4]   4.00-5.00   sec   541 MBytes
> > > 4.54 Gbits/sec    0    346 KBytes [  4]   5.00-6.00   sec   405
> > > MBytes  3.40 Gbits/sec    0    358 KBytes [  4]   6.00-7.00   sec
> > > 728 MBytes  6.11 Gbits/sec    1    370 KBytes [  4]   7.00-8.00
> > > sec   741 MBytes  6.22 Gbits/sec    0    355 KBytes
> > >
> > > Ceph cluster is shared by this Juno and old Havana (as mentioned,
> > > they use exactly same rbd pool), and IO on Havana just goes fine. Any
> > > suggestion or advice? So that we can make sure it is an issue of
> > > client, network, or ceph cluster and then go on. I am new to Ceph,
> > > need some help.
> > >
> > > Thanks
> > >
> > >
> > >
> > >
> > > ________________________________
> > >
> > > PLEASE NOTE: The information contained in this electronic mail
> > > message is intended only for the use of the designated recipient(s)
> > > named above. If the reader of this message is not the intended
> > > recipient, you are hereby notified that you have received this
> > > message in error and that any review, dissemination, distribution,
> > > or copying of this message is strictly prohibited. If you have
> > > received this communication in error, please notify the sender by
> > > telephone or e-mail (as shown above) immediately and destroy any and
> > > all copies of this message in your possession (whether hard copies
> > > or electronically stored copies).
> > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com