Hi everyone,
We have a Luminous cluster (12.2.10) on Ubuntu Xenial,
though we have
also observed the same behavior on 12.2.7 on Bionic
(download.ceph.com
doesn't build Luminous packages for Bionic, and 12.2.7
is the latest
distro build).
The primary use case for this cluster is radosgw. 6
OSD nodes, 22 OSDs
per node, of which 20 are SAS spinners and 2 are NVMe
devices. Cluster
has been deployed with ceph-ansible stable-3.1, we're
using
"objectstore: bluestore" and "osd_scenario:
collocated".
We're using a "class hdd" replicated CRUSH ruleset for
all our pools,
except:
- the bucket index pool, which uses a replicated
"class nvme" rule, and
- the bucket data pool, which uses an EC
(crush-device-class=hdd,
crush-failure-domain=host, k=3, m=2).
We also have 3 pools that we have created in order to
be able to do
benchmark runs while leaving the other pools
untouched, so we have
- bench-repl-hdd, replicated, size 3, using a CRUSH
rule with "step take
default class hdd"
- bench-repl-nvme, replicated, size 3, using a CRUSH
rule with "step
take default class nvme"
- bench-ec-hdd, EC, crush-device-class=hdd,
crush-failure-domain=host,
k=3, m=2.
Baseline benchmarks with "ceph tell osd.* bench" at
the default block
size of 4M yield pretty exactly the throughput you'd
expect from the
devices: approx. 185 MB/s from the SAS drives; the
NVMe devices
currently pull only 650 MB/s on writes but that may
well be due to
pending conditioning — this is new hardware.
Now when we run "rados bench" against the replicated
pools, we again get
exactly what we expect for a nominally performing but
largely untuned
system.
It's when we try running benchmarks against the EC
pool that everything
appears to grind to a halt:
http://paste.openstack.org/show/738187/
After 19 seconds, that pool does not accept a single
further object. We
simultaneously see slow request warnings creep up in
the cluster, and
the only thing we can then do is kill the benchmark,
and wait for the
slow requests to clear out.
We've also seen the log messages discussed in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
and they seem to correlate with the slow requests
popping up, but from
Greg's reply in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html
I'm assuming that that's benign and doesn't warrant
further investigation.
Here's a few things we've tried, to no avail:
- Make sure we use the latest Luminous release (we
started out on Bionic
and 12.2.7, then reinstalled systems with Xenial so we
could use 12.2.10).
- Enable Bluestore buffered writes
(bluestore_default_buffered_write =
true); buffered reads are on by default.
- Extend the BlueStore cache from 1G to 4G
(bluestore_cache_size_hdd =
4294967296; each OSD box has 128G RAM so should not
run into memory
starvation issues with that).
But those were basically "let's give this a shot and
see if it makes a
difference" attempts (it didn't).
I'm basically looking for ideas where even to start
looking. So if
anyone can guide us into the right direction, that
would be excellent.
Thanks in advance for any help you can offer; it is
much appreciated!
Cheers,
Florian