Re: EC pools grinding to a screeching halt on Luminous

Mohamad Gebai <mgebai@xxxxxxx> · Wed, 26 Dec 2018 12:27:46 -0500

What is happening on the individual nodes when you reach that point
(iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
benchmarking the replicated pool?

I guess one way to start would be by looking at ongoing operations at
the OSD level:

ceph daemon osd.X dump_blocked_ops
ceph daemon osd.X dump_ops_in_flight
ceph daemon osd.X dump_historic_slow_ops

(see ceph daemon osd.X help) for more commands.

The first command will show currently blocked operations. The last
command shows recent slow operations. You can follow the flow of
individual operations, and you might find that the slow operations are
all associated with the same few PGs, or that they're spending too much
time waiting on something.

Hope that helps.

Mohamad

On 12/26/18 5:20 AM, Florian Haas wrote:
> Hi everyone,
>
> We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
> also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
> doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
> distro build).
>
> The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
> per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
> has been deployed with ceph-ansible stable-3.1, we're using
> "objectstore: bluestore" and "osd_scenario: collocated".
>
> We're using a "class hdd" replicated CRUSH ruleset for all our pools,
> except:
>
> - the bucket index pool, which uses a replicated "class nvme" rule, and
> - the bucket data pool, which uses an EC (crush-device-class=hdd,
> crush-failure-domain=host, k=3, m=2).
>
> We also have 3 pools that we have created in order to be able to do
> benchmark runs while leaving the other pools untouched, so we have
>
> - bench-repl-hdd, replicated, size 3, using a CRUSH rule with "step take
> default class hdd"
> - bench-repl-nvme, replicated, size 3, using a CRUSH rule with "step
> take default class nvme"
> - bench-ec-hdd, EC, crush-device-class=hdd, crush-failure-domain=host,
> k=3, m=2.
>
> Baseline benchmarks with "ceph tell osd.* bench" at the default block
> size of 4M yield pretty exactly the throughput you'd expect from the
> devices: approx. 185 MB/s from the SAS drives; the NVMe devices
> currently pull only 650 MB/s on writes but that may well be due to
> pending conditioning — this is new hardware.
>
> Now when we run "rados bench" against the replicated pools, we again get
> exactly what we expect for a nominally performing but largely untuned
> system.
>
> It's when we try running benchmarks against the EC pool that everything
> appears to grind to a halt:
>
> http://paste.openstack.org/show/738187/
>
> After 19 seconds, that pool does not accept a single further object. We
> simultaneously see slow request warnings creep up in the cluster, and
> the only thing we can then do is kill the benchmark, and wait for the
> slow requests to clear out.
>
> We've also seen the log messages discussed in
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
> and they seem to correlate with the slow requests popping up, but from
> Greg's reply in
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html
> I'm assuming that that's benign and doesn't warrant further investigation.
>
> Here's a few things we've tried, to no avail:
>
> - Make sure we use the latest Luminous release (we started out on Bionic
> and 12.2.7, then reinstalled systems with Xenial so we could use 12.2.10).
> - Enable Bluestore buffered writes (bluestore_default_buffered_write =
> true); buffered reads are on by default.
> - Extend the BlueStore cache from 1G to 4G (bluestore_cache_size_hdd =
> 4294967296; each OSD box has 128G RAM so should not run into memory
> starvation issues with that).
>
> But those were basically "let's give this a shot and see if it makes a
> difference" attempts (it didn't).
>
> I'm basically looking for ideas where even to start looking. So if
> anyone can guide us into the right direction, that would be excellent.
> Thanks in advance for any help you can offer; it is much appreciated!
>
> Cheers,
> Florian
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com