Pinpointing performance bottleneck / would SSD journals help?

Daniel Schneller <daniel.schneller@xxxxxxxxxxxxxxxx> · Mon, 27 Jun 2016 17:42:07 +0200

Hi!

We are currently trying to pinpoint a bottleneck and are somewhat stuck.

First things first, this is the hardware setup:

4x DELL PowerEdge R510, 12x4TB OSD HDDs, journal colocated on HDD
  96GB RAM, 2x6 Cores + HT
2x1GbE bonded interfaces for Cluster Network
2x1GbE bonded interfaces for Public Network
Ceph Hammer on Ubuntu 14.04

6 OpenStack Compute Nodes with all-RBD VMs (no ephemeral storage).

The VMs run a variety of stuff, most notable MongoDB, Elasticsearch
and our custom software which uses both the VM's virtual disks as
well the Rados Gateway for Object Storage.

Recently, under certain more write intensive conditions we see reads
overall system performance starting to suffer as well.

Here is an iostat -x 3 sample for one of the VMs hosting MongoDB.
Notice the "await" times (vda is the root, vdb is the data volume).

Linux 3.13.0-35-generic (node02) 	06/24/2016 	_x86_64_	(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          1.55    0.00    0.44    0.42    0.00   97.59

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.91    0.09    1.01     2.55     9.59    
22.12     0.01  266.90 2120.51   98.59   4.76   0.52
vdb               0.00     1.53   18.39   40.79   405.98   483.92    
30.07     0.30    5.68    5.42    5.80   3.96  23.43

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          5.05    0.00    2.08    3.16    0.00   89.71

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb               0.00     7.00   23.00   29.00   368.00   500.00    
33.38     1.91  446.00  422.26  464.83  19.08  99.20

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          4.43    0.00    1.73    4.94    0.00   88.90

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s 
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.00    0.00    0.00     0.00     0.00     
0.00     0.00    0.00    0.00    0.00   0.00   0.00
vdb               0.00    13.00   45.00   83.00   712.00  1041.00    
27.39     2.54 1383.25  272.18 1985.64   7.50  96.00

If we read this right, the average time spent waiting for read or write
requests to be serviced can be multi-second. This would go in line with
MongoDB's slow log, where we see fully indexed queries, returning a
single result, taking over a second, where they would normally be finished
quasi instantly.

So far we have looked at these metrics (using StackExchange's Bosun
from https://bosun.org). Most values are collected every 15 seconds.

* Network Link saturation.
 All links / bonds are well below any relevant load (around 35MB/s or
 less)

* Storage Node RAM
 At least 3GB reported "free", between 50GB and 70GB as cached.

* Storage node CPU.
 Hardly above 30%

* # of ios in progress per OSD (as per /proc/diskstats)
 These reach values of up to 180.

Bosun collects the raw data for these metrics (and lots of others)
every 15 seconds.

We have a suspicion the spinners are the culprit here, but to verify
this and to be able to convince the upper layers of company leadership
to invest in some SSDs for journals, we need better evidence; apart
from the personal desire to understand exactly what's going on here :)

Regardless of the VMs on top (which could be any client, as I see it)
which metrics would I have to collect/look at to verify/reject the
assumption that we are limited by our pure HDD setup?

Thanks a lot!

Daniel

--
Daniel Schneller
Principal Cloud Engineer

CenterDevice GmbH
https://www.centerdevice.de

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com