Re: How to debug a ceph read performance problem?

Christian Balzer <chibi@xxxxxxx> · Fri, 15 May 2015 16:07:28 +0900

Hello,

On Thu, 14 May 2015 14:32:11 +0800 changqian zuo wrote:

> Hi,
> 
> 1. The network problem has been partly resovled, we removed bonding of
> Juno node (Ceph client side), and now IO comes back:
>
So the problem was not (really) with the Ceph cluster at all.

> [root@controller fio-rbd]# rados bench -p test 30 seq
>    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat 0       0         0         0         0         0         -         0
>      1      16       176       160     639.7       640  0.186673
> 0.0933836 2      16       339       323   645.795       652  0.079945
> 0.0965533 3      16       509       493   657.153       680   0.06882
> 0.0957288 4      16       672       656   655.837       652  0.068071
> 0.0963944 5      16       828       812    649.45       624  0.061999
> 0.0975488 6      16       989       973   648.513       644  0.110632
> 0.0979637 7      16      1139      1123   641.565       600  0.078144
> 0.0983299 8      16      1295      1279   639.349       624  0.243684
> 0.0991592 9      16      1453      1437   638.522       632   0.08775
> 0.0993148 10      16      1580      1564   625.461       508  0.061375
> 0.101921
> 
> The bonding is constructed by interface em1 and em2, the problematic
> interface is em2. Traffic from some stroage nodes to em2 are quite well,
> but some are not good. Still don't know the exact issue at the moment,
> but it is surely a network problem.
>
 > 2. About monitors:
> - monitors has not been restarted for at least half a year.
> - "ceph tell mon.bj-ceph14 compact" just stuck until I Ctrl+C, the same
> for other monitor nodes.
That's odd, wonder if it just takes a _loong_ time to do that compaction
and if a restart would be faster (if compact on start is set of course).

> - /var/lib/ceph/mon share Linux system disk (RAID1 of two HDD)
> 
That's a software RAID1, not HW RAID1 with cache?
Ceph, especially "older" versions like their MON leveldb to be on fast
storage.
I've seen a MON drop out due to insufficient IOPS during a MD RAID1
rebuild on top of slowish SATA drives.

> I will go through google and mail-list later.
> 
> 3. About memory, yes, I made things wrong. Will spend some time on "atop"
> :-)
> 
> 4. Single CPU with 4 core, without hyperthread.
> 
> So, CPU need to be upgraded, and OSD number per Ceph node should be
> reduced (spare some CPU power for more SSD), and add more SSD journal
> disk. 

You will want to look with atop during normal and stress test (rados bench)
times to determine what exactly your limits and bottlenecks are.

> Also, I am planning to upgrade OSD data disk from 1TB to 4TB. I
> will look through mails about ratio between OSD data disk and journal
> disk, and space and performance requirement for journal SSD.
> 
Keep in mind that more data per OSD will translate into larger basic
memory requirement for the OSD processes and that of course a disk failure
will then also result in more data movement.

Going back to the example of one of your current journal SSDs failing.
With half full 1TB disks, that's 6TB of data to re-balance. 
With half full 4TB disks, that's 24TB...

The idea here is to strike a balance between density (which saves money
for servers, infrastructure and rack space) and scaling out as much as
possible (many storage nodes with few, small HDDs), which is where Ceph
excels. 

Christian

> 
> 2015-05-14 10:39 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> 
> >
> > Hello,
> >
> > On Thu, 14 May 2015 09:36:14 +0800 changqian zuo wrote:
> >
> > > 1. No packet drop found in system log.
> > >
> > Is that storage node with the "bad" network fixed?
> >
> > > 2. "ceph health detail" shows:
> > >
> > > # ceph health detail
> > > HEALTH_WARN
> > > mon.bj-ceph10 addr 10.10.11.23:6789/0 has 43% avail disk space --
> > > store is getting too big! 77364 MB >= 40960 MB
> > > mon.bj-ceph12 addr 10.10.11.25:6789/0 has 43% avail disk space --
> > > store is getting too big! 77071 MB >= 40960 MB
> > > mon.bj-ceph13 addr 10.10.11.26:6789/0 has 42% avail disk space --
> > > store is getting too big! 78403 MB >= 40960 MB
> > > mon.bj-ceph14 addr 10.10.11.27:6789/0 has 43% avail disk space --
> > > store is getting too big! 78006 MB >= 40960 MB
> > >
> > > I am checking out what does this mean exactly.
> > >
> > You will find a lot of answers looking for "compact mon storage",
> > including a very recent thread here.
> > In short, I suppose those monitors have not been re-started for a long
> > time, right?
> > Also, you have a pretty big cluster, so this isn't all that surprising.
> >
> > I'd suggest to do a "ceph tell mon.mon.bj-ceph14 compact" and if that
> > works out well, repeat with the others.
> > Are your MONs using SSDs, for /var/lib/ceph in particular?
> >
> > > 3. by "run out", I mean:
> > >
> > > # free -m
> > >              total       used       free     shared    buffers
> > > cached Mem:         64376      63766        609          0
> > > 123      47974 -/+ buffers/cache:      15669      48707
> > > Swap:        22831       2319      20512
> > >
> > That doesn't look too bad, only 16MB used by processes, the rest is
> > cache and friends. However during recovery this usage will get higher
> > and Ceph also benefits from large pagecaches for hot object reads.
> > So doubling the memory might be a good long term goal.
> >
> > > "top" shows memory are mainly used by ceph-osd process.
> > You might want to spent some time learning "atop", as it will show you
> > what is going on in every part of your system (huge terminal window
> > helps).
> >
> > >
> > > 4. Cluster configuration, for a single Ceph node
> > >
> > > CPU: Intel(R) Xeon(R) CPU E5-2603 v2 @ 1.80GHz
> > Single CPU???
> > That's optimistically 8GHz of CPU, when the recommendation for purely
> > HDD based OSDs is 1GHz per OSD. Since you're using SSD journals, you
> > will want double that to not be CPU limited for many use cases.
> >
> > > memory: 64GB
> > > Data disk: 1TB HDD * 24 (do not know vendor now)
> > > Journal disk: 800TB SSD * 2 (do not know vendor now)
> > >
> > > We have run 24 OSDs on one node! I think this is why the memory in
> > > shortage (also CPU may not afford in highload recovery or reblance,
> > > and 2 SSD for 24 OSD journal is just not enough) and slow OSD write
> > > logged, if reduced to 16 or 12, it would be much better.
> > >
> > You're likely to be CPU bound a lot of times during normal operations
> > (provided everything else in your cluster is working correctly).
> > Depending on the type of SSD they should be fast enough to saturate
> > your 10GB/s network.
> > However you seem to have 9 storage nodes and loosing 12 OSDs when one
> > of these SSDs should fail is a pretty scary prospect.
> >
> > > 5. about "ceph osd tree". It seems some nodes have served, but lately
> > > OSDs on them are all removed:
> > >
> > > ......
> > > 156 0.91 osd.156 up 1
> > > 157 0.91 osd.157 up 1
> > > 158 0.91 osd.158 up 1
> > > -16 0 host bj-ceph20
> > > -17 0 host bj-ceph15
> > > -18 0 host bj-ceph11
> > > -19 21.84 host bj-ceph08
> > > 219 0.91 osd.219 up 1
> > > 220 0.91 osd.220 up 1
> > > 221 0.91 osd.221 up 1
> > > 208 0.91 osd.208 up 1
> > > 207 0.91 osd.207 up 1
> > > 209 0.91 osd.209 up 1
> > > 215 0.91 osd.215 up 1
> > > 211 0.91 osd.211 up 1
> > > .....
> > >
> > Well, that's nothing to worry about now.
> > You could clean that up later, but removing those hosts will cause data
> > movement due to CRUSH changes, thus leaving it as it is might be just
> > as well.
> >
> > > Seems I have a lot of work to do...
> > >
> > Indeed...
> >
> > Christian
> > >
> > > 2015-05-13 22:13 GMT+08:00 Somnath Roy <Somnath.Roy@xxxxxxxxxxx>:
> > >
> > > >  If you are seeing ceph –s or ceph osd tree is showing all up ,
> > > > but in reality it is not.  I think the heart bit traffic is
> > > > getting lost may be due to network problem you indicated.
> > > >
> > > > Also, the point that your node memory is running out means
> > > > probably it is trying to do recovery. OSDs are flapping.
> > > >
> > > >
> > > >
> > > > Give us the following information.
> > > >
> > > >
> > > >
> > > > 1. Are you seeing any packet drop in the system log ?
> > > >
> > > >
> > > >
> > > > 2. Also, how many host serving this OSDs ? Host cpu/memory ?
> > > >
> > > >
> > > >
> > > > 3. Is this all HDD cluster ?
> > > >
> > > >
> > > >
> > > > 4. Yeah, it is showing health warning, but, not saying anything !
> > > > Run, ‘ceph health detail’ to see if it gives extra details or not.
> > > >
> > > >
> > > >
> > > > Thanks & Regards
> > > >
> > > > Somnath
> > > >
> > > >
> > > >
> > > > *From:* changqian zuo [mailto:dummyhacker85@xxxxxxxxx]
> > > > *Sent:* Wednesday, May 13, 2015 3:10 AM
> > > > *To:* Christian Balzer; Somnath Roy
> > > > *Cc:* ceph-users@xxxxxxxxxxxxxx
> > > > *Subject:* Re:  How to debug a ceph read performance
> > > > problem?
> > > >
> > > >
> > > >
> > > > Thanks for your time,
> > > >
> > > >
> > > >
> > > > I write a  Python script and re-analysing client messaging log,
> > > > list all OSD ip and the time used for the read request to this
> > > > OSD, and find slow replies are all from node 10.10.11.12. So I do
> > > > a network test, and there is problem:
> > > >
> > > >
> > > >
> > > > # iperf3 -c 10.10.11.15 -t 60 -i 1
> > > >
> > > > Connecting to host 10.10.11.15, port 5201
> > > >
> > > > [  4] local 10.10.11.12 port 53944 connected to 10.10.11.15 port
> > > > 5201
> > > >
> > > > [ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
> > > >
> > > > [  4]   0.00-1.00   sec   123 KBytes  1.01 Mbits/sec   20   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   1.00-2.00   sec  67.9 KBytes   556 Kbits/sec   16   2.83
> > > > KBytes
> > > >
> > > >
> > > > [  4]   2.00-3.00   sec   192 KBytes  1.58 Mbits/sec   23   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   3.00-4.00   sec  22.6 KBytes   185 Kbits/sec   11   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   4.00-5.00   sec  76.4 KBytes   626 Kbits/sec   18   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   5.00-6.00   sec  25.5 KBytes   209 Kbits/sec   10   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   6.00-7.00   sec   107 KBytes   880 Kbits/sec   15   2.83
> > > > KBytes
> > > >
> > > >
> > > > [  4]   7.00-8.00   sec   127 KBytes  1.04 Mbits/sec   23   2.83
> > > > KBytes
> > > >
> > > >
> > > > [  4]   8.00-9.00   sec   130 KBytes  1.07 Mbits/sec   17   4.24
> > > > KBytes
> > > >
> > > >
> > > > [  4]   9.00-10.00  sec   119 KBytes   973 Kbits/sec   19   4.24
> > > > KBytes
> > > >
> > > >
> > > >
> > > > I've asked ops to tweak this yesterday, now it seems this is not
> > > > completely done. I am still waiting for them to handle this and do
> > > > another test. This may explain why the io rate started high and
> > > > suddenly goes down. But there may still be other issue.
> > > >
> > > >
> > > >
> > > > Our cluster have been not very well maintained, "ceph -s" shows:
> > > >
> > > >
> > > >
> > > > [root@controller fio-rbd]# ceph -s
> > > >
> > > >     cluster 6d2bb752-db69-48ff-9df4-7d85703f322e
> > > >
> > > >      health HEALTH_WARN
> > > >
> > > >      monmap e29: 5 mons at {bj-ceph09=
> > > >
> > 10.10.11.22:6789/0,bj-ceph10=10.10.11.23:6789/0,bj-ceph12=10.10.11.25:6789/0,bj-ceph13=10.10.11.26:6789/0,bj-ceph14=10.10.11.27:6789/0
> > },
> > > > election epoch 180736, quorum 0,1,2,3,4
> > > > bj-ceph09,bj-ceph10,bj-ceph12,bj-ceph13,bj-ceph14
> > > >
> > > >      osdmap e240303: 216 osds: 216 up, 216 in
> > > >
> > > >       pgmap v31819901: 10048 pgs, 3 pools, 28274 GB data, 4736
> > > > kobjects
> > > >
> > > >             85880 GB used, 111 TB / 195 TB avail
> > > >
> > > >                10048 active+clean
> > > >
> > > >   client io 208 kB/s rd, 18808 kB/s wr, 1463 op/s
> > > >
> > > >
> > > >
> > > > health is HEALTH_WARN, but can not tell what goes wrong at first
> > > > glance.
> > > >
> > > >
> > > >
> > > > "ceph osd tree" shows all OSD is up, but there are ceph node
> > > > without any OSD up and running. And I do see memory in many Ceph
> > > > storage nodes are almost or been run out.
> > > >
> > > >
> > > >
> > > > I think I will work on these issues soon.
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > 2015-05-13 15:17 GMT+08:00 Christian Balzer <chibi@xxxxxxx>:
> > > >
> > > >
> > > > Hello,
> > > >
> > > > in addition to what Somnath wrote, if you're seeing this kind of
> > > > blocking reads _and_ have slow write warnings in the logs, your
> > > > cluster is likely either unhealthy and/or underpowered for it's
> > > > current load.
> > > >
> > > > If your cluster is healthy, you may want to investigate what's
> > > > busy, my guess is the OSDs/HDDs.
> > > > Also any scrubs my drag your performance down especially
> > > > deep-scrubs.
> > > >
> > > > However this doesn't really explain any difference between Juno and
> > > > Havanna, as both should suffer from a sickly Ceph cluster.
> > > >
> > > > Christian
> > > >
> > > >
> > > > On Wed, 13 May 2015 06:51:56 +0000 Somnath Roy wrote:
> > > >
> > > > > Can you give some more insight about the ceph cluster you are
> > > > > running ? It seems IO started and then no response..cur MB/s is
> > > > > becoming 0s.. What is ‘ceph –s’ output  ?
> > > > > Hope all the OSDs are up and running..
> > > > >
> > > > > Thanks & Regards
> > > > > Somnath
> > > > >
> > > > > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On
> > > > > Behalf Of changqian zuo Sent: Tuesday, May 12, 2015 9:00 PM
> > > > > To: ceph-users@xxxxxxxxxxxxxx
> > > > > Subject:  How to debug a ceph read performance
> > > > > problem?
> > > > >
> > > > > Hi, guys,
> > > > >
> > > > > We have been running an OpenStack Havana environment with Ceph
> > > > > 0.72.2 as block storage backend. Recently we were trying to
> > > > > upgrade OpenStack to Juno. For testing, we deployed a Juno
> > > > > all-in-one node, this node share the same Cinder volume rbd pool
> > > > > and Glance image rbd pool with the old Havana.
> > > > >
> > > > > And after some test, we found a serious read performance problem
> > > > > in Juno client (write is just OK), something like:
> > > > >
> > > > > # rados bench -p test 30 seq
> > > > >    sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat
> > > > > avg lat 0       0         0         0         0         0
> > > > > -         0 1      16       100        84   335.843       336
> > > > > 0.020221 0.0393582 2      16       100        84   167.944
> > > > > 0         - 0.0393582 3      16       100        84
> > > > > 111.967         0         - 0.0393582 4      16       100
> > > > > 84   83.9769         0         - 0.0393582 5      16
> > > > > 100        84   67.1826         0         - 0.0393582 6
> > > > > 16       100        84   55.9863         0         - 0.0393582
> > > > > 7      16       100        84   47.9886         0         -
> > > > > 0.0393582 8      16       100        84   41.9905         0
> > > > > - 0.0393582 9      16       100        84   37.3249
> > > > > 0         - 0.0393582 10      16       100        84
> > > > > 33.5926         0         - 0.0393582 11      16       100
> > > > > 84   30.5388         0         - 0.0393582 12      16
> > > > > 100        84   27.9938         0         - 0.0393582 13
> > > > > 16       100        84   25.8405         0         - 0.0393582
> > > > > 14      16       100        84   23.9948         0         -
> > > > > 0.0393582 15      16       100        84   22.3952         0
> > > > > - 0.0393582
> > > > >
> > > > > And when testing RBD image with fio (bs=512k read), there are:
> > > > >
> > > > > # grep 12067 ceph.client.log | grep read
> > > > > 2015-05-11 16:19:36.649554 7ff9949d5a00  1 --
> > > >
> > > > > 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> -->
> > > > > 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> --
> > > > > osd_op(client.3772684.0:12067
> > > > > rbd_data.262a6e7bf17801.0000000000000003 [sparse-read
> > > > > 2621440~524288] 7.c43a3ae3 e240302) v4 -- ?+0 0x7ff9967c5fb0 con
> > > > > 0x7ff99a41c420 2015-05-11 16:20:07.709915 7ff94bfff700  1 --
> > > > > 10.10.11.15:0/2012449<http://10.10.11.15:0/2012449> <== osd.218
> > > > > 10.10.11.21:6835/45746<http://10.10.11.21:6835/45746> 111 ====
> > > > > osd_op_reply(12067 rbd_data.262a6e7bf17801.0000000000000003
> > > > > [sparse-read 2621440~524288] v0'0 uv3803266 ondisk = 0) v6 ====
> > > > > 199+0+524312 (3484234903 0 0) 0x7ff3a4002ba0 con 0x7ff99a41c420
> > > > >
> > > > > Some operation takes more an minute.
> > > > >
> > > > > I checked OSD log (default logging level,
> > > > > ceph.com<http://ceph.com> said
> > > >
> > > > > when a request takes too long, it will complain in log), and do
> > > > > see some slow 4k write request, but no read.
> > > > >
> > > > > We have tested Giant, Firefly, and self-built Emperor client,
> > > > > same sad results.
> > > > >
> > > > > The network between OSD and all-in-one node is 10Gb network,
> > > > > this is from client to OSD:
> > > > >
> > > > > # iperf3 -c 10.10.11.25 -t 60 -i 1
> > > > > Connecting to host 10.10.11.25, port 5201
> > > > > [  4] local 10.10.11.15 port 41202 connected to 10.10.11.25 port
> > > > > 5201 [ ID] Interval           Transfer     Bandwidth       Retr
> > > > > Cwnd [  4]   0.00-1.00   sec  1.09 GBytes  9.32 Gbits/sec   11
> > > > > 2.02 MBytes [  4]   1.00-2.00   sec  1.09 GBytes  9.35
> > > > > Gbits/sec   34 1.53 MBytes [  4]   2.00-3.00   sec  1.09 GBytes
> > > > > 9.35 Gbits/sec 11   1.14 MBytes [  4]   3.00-4.00   sec  1.09
> > > > > GBytes  9.37 Gbits/sec    0   1.22 MBytes [  4]   4.00-5.00
> > > > > sec  1.09 GBytes 9.34 Gbits/sec    0   1.27 MBytes
> > > > >
> > > > > and OSD to client (there may be some problem in client interface
> > > > > bonding, 10Gb could not by reached):
> > > > >
> > > > > # iperf3 -c 10.10.11.15 -t 60 -i 1
> > > > > Connecting to host 10.10.11.15, port 5201
> > > > > [  4] local 10.10.11.25 port 43934 connected to 10.10.11.15 port
> > > > > 5201 [ ID] Interval           Transfer     Bandwidth       Retr
> > > > > Cwnd [  4]   0.00-1.00   sec   400 MBytes  3.35 Gbits/sec
> > > > > 1    337 KBytes [  4]   1.00-2.00   sec   553 MBytes  4.63
> > > > > Gbits/sec    1 341 KBytes [  4]   2.00-3.00   sec   390 MBytes
> > > > > 3.27 Gbits/sec 1    342 KBytes [  4]   3.00-4.00   sec   395
> > > > > MBytes  3.32 Gbits/sec    0    342 KBytes [  4]   4.00-5.00
> > > > > sec   541 MBytes 4.54 Gbits/sec    0    346 KBytes [  4]
> > > > > 5.00-6.00   sec   405 MBytes  3.40 Gbits/sec    0    358 KBytes
> > > > > [  4]   6.00-7.00   sec 728 MBytes  6.11 Gbits/sec    1    370
> > > > > KBytes [  4]   7.00-8.00 sec   741 MBytes  6.22 Gbits/sec
> > > > > 0    355 KBytes
> > > > >
> > > > > Ceph cluster is shared by this Juno and old Havana (as mentioned,
> > > > > they use exactly same rbd pool), and IO on Havana just goes
> > > > > fine. Any suggestion or advice? So that we can make sure it is
> > > > > an issue of client, network, or ceph cluster and then go on. I
> > > > > am new to Ceph, need some help.
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > ________________________________
> > > > >
> > > > > PLEASE NOTE: The information contained in this electronic mail
> > > > > message is intended only for the use of the designated
> > > > > recipient(s) named above. If the reader of this message is not
> > > > > the intended recipient, you are hereby notified that you have
> > > > > received this message in error and that any review,
> > > > > dissemination, distribution, or copying of this message is
> > > > > strictly prohibited. If you have received this communication in
> > > > > error, please notify the sender by telephone or e-mail (as shown
> > > > > above) immediately and destroy any and all copies of this
> > > > > message in your possession (whether hard copies or
> > > > > electronically stored copies).
> > > > >
> > > >
> > > >   --
> > > > Christian Balzer        Network/Systems Engineer
> > > > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > > > http://www.gol.com/
> > > >
> > > >
> > > >
> > > > ------------------------------
> > > >
> > > > PLEASE NOTE: The information contained in this electronic mail
> > > > message is intended only for the use of the designated
> > > > recipient(s) named above. If the reader of this message is not the
> > > > intended recipient, you are hereby notified that you have received
> > > > this message in error and that any review, dissemination,
> > > > distribution, or copying of this message is strictly prohibited.
> > > > If you have received this communication in error, please notify
> > > > the sender by telephone or e-mail (as shown above) immediately and
> > > > destroy any and all copies of this message in your possession
> > > > (whether hard copies or electronically stored copies).
> > > >
> > > >
> >
> >
> > --
> > Christian Balzer        Network/Systems Engineer
> > chibi@xxxxxxx           Global OnLine Japan/Fusion Communications
> > http://www.gol.com/
> >

-- 
Christian Balzer        Network/Systems Engineer                
chibi@xxxxxxx   	Global OnLine Japan/Fusion Communications
http://www.gol.com/
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com