A disk move gone wrong & Luminous vs. Nautilus performance

Nico Schottelius <nico.schottelius@xxxxxxxxxxx> · Wed, 23 Sep 2020 12:19:54 +0200

Good morning,

you might have seen my previous mails and I wanted to discuss some
findings over the last day+night over what happened and why it happened
here.

As the system behaved inexplicitly for us, we are now looking for
someone to analyse the root cause on consultancy basis - if you are
interested & available, please drop me a PM.

What happened
-------------
Yesterday at about 1520 we moved 4 SSDs from server15 to server8. These
SSDs are located in 3 SSD only pools. Soon after that the cluster begun
to exhibit slow I/O on clients and saw many PGs in the peering state.

Additionally we began to see slow ops on unrelated OSDs, i.e. osds that
have a device class "hdd-big" set and are only selected by the "hdd"
pool. Excerpt from ceph -s:

      167 slow ops, oldest one blocked for 958 sec, daemons [osd.0,osd.11,osd.12,osd.14,osd.15,osd.
      16,osd.19,osd.2,osd.22,osd.25]... have slow ops.

All pgs that were hanging in peering state seemed to be related to pools
from the moved SSDs. However the question remains why other osds began
to be affected.

Then we noticed that the newly started OSDs on server8 are consuming
100% cpu, the associated disks have a queue depth of 1 and the disks are
80-99% busy. This is a change that we noticed some weeks ago in the
nautilus osd behaviour: while in luminous, osds would start right away,
osds often take seconds to minutes to start in nautilus, accompanied with
high cpu usage. Before yesterday however, the maximum startup time was
in the range of minutes.

When we noticed blocked/slow IOPS, we wanted to ensure that the clients
are least possible affected. We tried to submit the two changes to all
osds:

ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd-max-backfills 1'

However at this time, ceph tell would hang. We then switched to address
each OSD individually using a for loop:

for osd in $(ceph osd tree | grep osd. | awk '{ print $4 }'); do echo $osd; ( ceph tell $osd injectargs '--osd-max-backfills 1' &) ; done

However this resulted in many hanging ceph tell processes and some of
them began to report the following error:

2020-09-22 15:59:25.407 7fbc0359e700  0 --1- [2a0a:e5c0:2:1:20d:b9ff:fe48:3bd4]:0/1280010984 >> v1:[2a0a:e5c0:2:1:21b:21ff:febb:68f0]:6856/4078 conn(0x7fbbec02edc0 0x7fbbec02f5d0 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=0 cs=0 l=1).handle_connect_reply_2 connect got BADAUTHORIZER

This message would be repeated many times a second per ceph tell process
until we SIGINT killed it.

After this failed we noticed that the remaining HDDs in server15 were at
100% i/o usage at about 0.6-2 MB/s reading rate. These low rates would
very well explain the slow I/O on the clients - but it made us wonder
why the rates were suddenly so low. We restarted one of the OSDs on this
server, but the effect was that the osd would now use 100% cpu
additionally to fully utilising the disk.

We began to suspected a hardware problem and moved all HDDs to
server8. Unfortunately, all newly HDD OSDs behave exactly the same:
Very high cpu usage (60-100%), very high disk utilisation (about 90%+)
and very low transfer rates (~1 MB/s on average).

After a few hours (!) some of the OSDs normalised to 20-50MB/s read
rates and after about 8h later all of them normalised.

What is curious that during the slow phase of many hours, the average
queue depth of the disk consistently stays at 1, while afterwards they
usually average around 10-15. Given full utilisation, we suspect that
the nautilus OSDs cause severly disk seeks.

Our expectation (and so far behaviour with Luminous) was that if we have
a single host failure, setting max backfills and max recovery to 1 and
keeping the osd_recovery_op_priority low (usually at 2 in our clusters)
has some, but more a minor impact on client I/O.

This time, purely moving 4 osds with existing data made the cluster
practically unusable, something we never experienced in a Luminous setup
before. While I do not want to render out any potential mistakes or
incorrect designs from our side, the effects seen are not expected from
our side.

While all of this happening we measured the network bandwidth on all
nodes without congestion, monitored the monitors with some (spikes up to
50% from one core), but not high cpu usage.

I was wondering whether anyone on the list has an insight on these
questions:

- Why were OSDs slow that are used in a different pool?
- What changed in the OSDs from Luminous to Nautilus to make the startup
  phase slow?
- What are nautilus OSDs doing at startup and why do they consume a lot
  of CPU?
- What are the OSDs doing to severly impact the I/O? Is our seek theory correct?
- Why does / can ceph tell hang in Nautilus? We never experienced this
  problem in Luminous.
- Where does the BADAUTHORIZER message come from and what is the fix for it?
- How can we debug / ensure that ceph tell does not hang?
- Did code in the osds change that de-prioritizes command traffic?

Best regards,

Nico

--
Modern, affordable, Swiss Virtual Machines. Visit www.datacenterlight.ch
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx