Re: Pinpointing performance bottleneck / would SSD journals help?

Lionel Bouton <lionel-subscription@xxxxxxxxxxx> · Mon, 27 Jun 2016 18:01:07 +0200

Le 27/06/2016 17:42, Daniel Schneller a écrit :
> Hi!
>
> We are currently trying to pinpoint a bottleneck and are somewhat stuck.
>
> First things first, this is the hardware setup:
>
> 4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
>   96GB RAM, 2x6 Cores + HT
> 2x1GbE bonded interfaces for Cluster Network
> 2x1GbE bonded interfaces for Public Network
> Ceph Hammer on Ubuntu 14.04
>
> 6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).
>
> The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
> and our custom software which uses both the VM's virtual disks as
> well the Rados Gateway for Object Storage.
>
> Recently, under certain more write intensive conditions we see reads
> overall system performance starting to suffer as well.
>
> Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
> Notice the "await" times (vda is the root, vdb is the data volume).
>
>
> Linux 3.13.0-35-generic (node02)     06/24/2016     _x86_64_    (16 CPU)
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           1.55    0.00    0.44    0.42    0.00   97.59
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda               0.00     0.91    0.09    1.01     2.55     9.59   
> 22.12     0.01  266.90 2120.51   98.59   4.76   0.52
> vdb               0.00     1.53   18.39   40.79   405.98   483.92   
> 30.07     0.30    5.68    5.42    5.80   3.96  23.43
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           5.05    0.00    2.08    3.16    0.00   89.71
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda               0.00     0.00    0.00    0.00     0.00     0.00    
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> vdb               0.00     7.00   23.00   29.00   368.00   500.00   
> 33.38     1.91  446.00  422.26  464.83  19.08  99.20
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>           4.43    0.00    1.73    4.94    0.00   88.90
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> vda               0.00     0.00    0.00    0.00     0.00     0.00    
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> vdb               0.00    13.00   45.00   83.00   712.00  1041.00   
> 27.39     2.54 1383.25  272.18 1985.64   7.50  96.00
>
>
> If we read this right, the average time spent waiting for read or write
> requests to be serviced can be multi-second. This would go in line with
> MongoDB's slow log, where we see fully indexed queries, returning a
> single result, taking over a second, where they would normally be
> finished
> quasi instantly.
>
> So far we have looked at these metrics (using StackExchange's Bosun
> from https://bosun.org). Most values are collected every 15 seconds.
>
> * Network Link saturation.
>  All links / bonds are well below any relevant load (around 35MB/s or
>  less)

Or you sure ? On each server you have 12 OSDs with a theoretical
bandwidth of at least half of 100MB/s (minimum bandwidth of any
reasonable HDD but halved because of the journal on the same device).
Which means your total disk bandwidth per server is 600MB/s.
Bonded links are not perfect aggregation (depending on the mode one
client will either always use the same link or have its traffic
imperfectly balanced between the 2), so your theoretical network
bandwidth is probably nearest to 1Gbps (~ 120MB/s).

What could happen is that the 35MB/s is an average over a large period
(several seconds), it's probably peaking at 120MB/s during short bursts.
I wouldn't use less than 10Gbps for both the cluster and public networks
in your case.

You didn't say how many VMs are running : the rkB/s and wkB/s seem very
low (note that for write intensive tasks your VM is reading quite a
bit...) but if you have 10 VMs or more battling for read and write
access this way it wouldn't be unexpected. As soon as latency rises for
one reason or another (here it would be network latency) you can expect
the total throughput of random accesses to plummet.

If your cluster isn't already backfilling or deep scrubbing you can
expect it to crumble on itself when it does (and it will have to perform
these at some point)...

Best regards,

Lionel
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com