Re: EC pools grinding to a screeching halt on Luminous

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 12/31/18 4:51 AM, Marcus Murwall wrote:
What you say does make sense though as I also get the feeling that the osds are just waiting for something. Something that never happens and the request finally timeout...

So the OSDs are just completely idle? If not, try using strace and/or perf to get some insights into what they're doing.

Maybe someone with better knowledge of EC internals will suggest something. In the mean time, you might want to look at the client side. Could the client be somehow saturated or blocked on something? (If the clients aren't blocked you can use 'perf' or Mark's profiler [1] to profile them).

Try benchmarking with an iodepth of 1 and slowly increase it until you run into the issue, all while monitoring your resources. You might find something that causes the tipping point. Are you able to reproduce this using fio? Maybe this is just a client issue..

Sorry for suggesting a bunch of things that are all over the place, I'm just trying to understand the state of the cluster (and clients). Are both the OSDs and the clients completely blocked and make no progress?

Let us know what you find.

Mohamad

[1] https://github.com/markhpc/gdbpmp/


I will have one of our network guys to take a look and get a second pair of eyes on it as well, just to make sure I'm not missing anything.

Thanks for your help so far Mohamad, I really appreciate it. If you have some more ideas/suggestions on where to look please let us know.

I wish you all a happy new year.

Regards
Marcus

28 December 2018 at 16:10
Hi Marcus,

On 12/27/18 4:21 PM, Marcus Murwall wrote:
Hey Mohamad

I work with Florian on this issue.
Just reinstalled the ceph cluster and triggered the error again.
Looking at iostat -x 1 there is basically no activity at all against any of the osds.
We get blocked ops all over the place but here are some output from one of the osds that had blocked requests: http://paste.openstack.org/show/738721/

Looking at the historic_slow_ops, the step in the pipeline that takes the most time is sub_op_applied -> commit_sent. I couldn't say exactly what these steps are from a high level view, but looking at the code, commit_sent indicates that a message has been sent to the OSD's client over the network. Can you look for network congestion (the fact that there's nothing happening on the disks points in that direction too)? Something like iftop might help. Is there anything suspicious in the logs?

Also, do you get the same throughput when benchmarking the replicated compared to the EC pool?

Mohamad



Regards
Marcus

26 December 2018 at 18:27
What is happening on the individual nodes when you reach that point
(iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
benchmarking the replicated pool?

I guess one way to start would be by looking at ongoing operations at
the OSD level:

ceph daemon osd.X dump_blocked_ops
ceph daemon osd.X dump_ops_in_flight
ceph daemon osd.X dump_historic_slow_ops

(see ceph daemon osd.X help) for more commands.

The first command will show currently blocked operations. The last
command shows recent slow operations. You can follow the flow of
individual operations, and you might find that the slow operations are
all associated with the same few PGs, or that they're spending too much
time waiting on something.

Hope that helps.

Mohamad


26 December 2018 at 11:20
Hi everyone,

We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
distro build).

The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
has been deployed with ceph-ansible stable-3.1, we're using
"objectstore: bluestore" and "osd_scenario: collocated".

We're using a "class hdd" replicated CRUSH ruleset for all our pools,
except:

- the bucket index pool, which uses a replicated "class nvme" rule, and
- the bucket data pool, which uses an EC (crush-device-class=hdd,
crush-failure-domain=host, k=3, m=2).

We also have 3 pools that we have created in order to be able to do
benchmark runs while leaving the other pools untouched, so we have

- bench-repl-hdd, replicated, size 3, using a CRUSH rule with "step take
default class hdd"
- bench-repl-nvme, replicated, size 3, using a CRUSH rule with "step
take default class nvme"
- bench-ec-hdd, EC, crush-device-class=hdd, crush-failure-domain=host,
k=3, m=2.

Baseline benchmarks with "ceph tell osd.* bench" at the default block
size of 4M yield pretty exactly the throughput you'd expect from the
devices: approx. 185 MB/s from the SAS drives; the NVMe devices
currently pull only 650 MB/s on writes but that may well be due to
pending conditioning — this is new hardware.

Now when we run "rados bench" against the replicated pools, we again get
exactly what we expect for a nominally performing but largely untuned
system.

It's when we try running benchmarks against the EC pool that everything
appears to grind to a halt:

http://paste.openstack.org/show/738187/

After 19 seconds, that pool does not accept a single further object. We
simultaneously see slow request warnings creep up in the cluster, and
the only thing we can then do is kill the benchmark, and wait for the
slow requests to clear out.

We've also seen the log messages discussed in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
and they seem to correlate with the slow requests popping up, but from
Greg's reply in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html
I'm assuming that that's benign and doesn't warrant further investigation.

Here's a few things we've tried, to no avail:

- Make sure we use the latest Luminous release (we started out on Bionic
and 12.2.7, then reinstalled systems with Xenial so we could use 12.2.10).
- Enable Bluestore buffered writes (bluestore_default_buffered_write =
true); buffered reads are on by default.
- Extend the BlueStore cache from 1G to 4G (bluestore_cache_size_hdd =
4294967296; each OSD box has 128G RAM so should not run into memory
starvation issues with that).

But those were basically "let's give this a shot and see if it makes a
difference" attempts (it didn't).

I'm basically looking for ideas where even to start looking. So if
anyone can guide us into the right direction, that would be excellent.
Thanks in advance for any help you can offer; it is much appreciated!

Cheers,
Florian


27 December 2018 at 22:21
Hey Mohamad

I work with Florian on this issue.
Just reinstalled the ceph cluster and triggered the error again.
Looking at iostat -x 1 there is basically no activity at all against any of the osds.

We get blocked ops all over the place but here are some output from one of the osds that had blocked requests: http://paste.openstack.org/show/738721/


Regards
Marcus


26 December 2018 at 18:27
What is happening on the individual nodes when you reach that point
(iostat -x 1 on the OSD nodes)? Also, what throughput do you get when
benchmarking the replicated pool?

I guess one way to start would be by looking at ongoing operations at
the OSD level:

ceph daemon osd.X dump_blocked_ops
ceph daemon osd.X dump_ops_in_flight
ceph daemon osd.X dump_historic_slow_ops

(see ceph daemon osd.X help) for more commands.

The first command will show currently blocked operations. The last
command shows recent slow operations. You can follow the flow of
individual operations, and you might find that the slow operations are
all associated with the same few PGs, or that they're spending too much
time waiting on something.

Hope that helps.

Mohamad


26 December 2018 at 11:20
Hi everyone,

We have a Luminous cluster (12.2.10) on Ubuntu Xenial, though we have
also observed the same behavior on 12.2.7 on Bionic (download.ceph.com
doesn't build Luminous packages for Bionic, and 12.2.7 is the latest
distro build).

The primary use case for this cluster is radosgw. 6 OSD nodes, 22 OSDs
per node, of which 20 are SAS spinners and 2 are NVMe devices. Cluster
has been deployed with ceph-ansible stable-3.1, we're using
"objectstore: bluestore" and "osd_scenario: collocated".

We're using a "class hdd" replicated CRUSH ruleset for all our pools,
except:

- the bucket index pool, which uses a replicated "class nvme" rule, and
- the bucket data pool, which uses an EC (crush-device-class=hdd,
crush-failure-domain=host, k=3, m=2).

We also have 3 pools that we have created in order to be able to do
benchmark runs while leaving the other pools untouched, so we have

- bench-repl-hdd, replicated, size 3, using a CRUSH rule with "step take
default class hdd"
- bench-repl-nvme, replicated, size 3, using a CRUSH rule with "step
take default class nvme"
- bench-ec-hdd, EC, crush-device-class=hdd, crush-failure-domain=host,
k=3, m=2.

Baseline benchmarks with "ceph tell osd.* bench" at the default block
size of 4M yield pretty exactly the throughput you'd expect from the
devices: approx. 185 MB/s from the SAS drives; the NVMe devices
currently pull only 650 MB/s on writes but that may well be due to
pending conditioning — this is new hardware.

Now when we run "rados bench" against the replicated pools, we again get
exactly what we expect for a nominally performing but largely untuned
system.

It's when we try running benchmarks against the EC pool that everything
appears to grind to a halt:

http://paste.openstack.org/show/738187/

After 19 seconds, that pool does not accept a single further object. We
simultaneously see slow request warnings creep up in the cluster, and
the only thing we can then do is kill the benchmark, and wait for the
slow requests to clear out.

We've also seen the log messages discussed in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028972.html,
and they seem to correlate with the slow requests popping up, but from
Greg's reply in
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-August/028974.html
I'm assuming that that's benign and doesn't warrant further investigation.

Here's a few things we've tried, to no avail:

- Make sure we use the latest Luminous release (we started out on Bionic
and 12.2.7, then reinstalled systems with Xenial so we could use 12.2.10).
- Enable Bluestore buffered writes (bluestore_default_buffered_write =
true); buffered reads are on by default.
- Extend the BlueStore cache from 1G to 4G (bluestore_cache_size_hdd =
4294967296; each OSD box has 128G RAM so should not run into memory
starvation issues with that).

But those were basically "let's give this a shot and see if it makes a
difference" attempts (it didn't).

I'm basically looking for ideas where even to start looking. So if
anyone can guide us into the right direction, that would be excellent.
Thanks in advance for any help you can offer; it is much appreciated!

Cheers,
Florian


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux