Help needed for diagnosing slow_requests

Uwe Sauter <uwe.sauter.de@xxxxxxxxx> · Fri, 17 Aug 2018 10:23:12 +0200

Dear devs,

I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation…

TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug?

I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end).
The main difference between those hosts is CPU generation (Westmere / Sandybridge),  and number of OSD disks.

The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox
virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management.

On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs.
OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore.

Now here's the thing:

Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that
cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host
with the OSD causing the slow request).

If I boot back into 4.13 then Ceph runs smoothly again.

I'm seeking for help to debug this issue as I'm running out of ideas what I could else do.
So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the
primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23")
but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is
no commit message from OSD 9).

On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD
does not commit there should be an operation that does not succeed, should it not?

Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that
does not commit has no effect on the issue.

Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not
a networking or kernel issue?

I can provide more specific info if needed.

Thanks,

  Uwe

#### Hardware details ####
Host type 1:
  CPU: 2x Intel Xeon E5-2670
  RAM: 64GiB
  Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)

Host type 2:
  CPU: 2x Intel Xeon E5606
  RAM: 96GiB
  Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
#### End Hardware ####

#### Ceph OSD Tree ####
ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
 -1       12.72653 root default
 -2        1.36418     host px-alpha-cluster
  0   hdd  0.22729         osd.0                    up  1.00000 1.00000
  1   hdd  0.22729         osd.1                    up  1.00000 1.00000
  2   hdd  0.90959         osd.2                    up  1.00000 1.00000
 -3        1.36418     host px-bravo-cluster
  3   hdd  0.22729         osd.3                    up  1.00000 1.00000
  4   hdd  0.22729         osd.4                    up  1.00000 1.00000
  5   hdd  0.90959         osd.5                    up  1.00000 1.00000
 -4        2.04648     host px-charlie-cluster
  6   hdd  0.90959         osd.6                    up  1.00000 1.00000
  7   hdd  0.22729         osd.7                    up  1.00000 1.00000
  8   hdd  0.90959         osd.8                    up  1.00000 1.00000
 -5        2.04648     host px-delta-cluster
  9   hdd  0.22729         osd.9                    up  1.00000 1.00000
 10   hdd  0.90959         osd.10                   up  1.00000 1.00000
 11   hdd  0.90959         osd.11                   up  1.00000 1.00000
-11        2.72516     host px-echo-cluster
 12   hdd  0.45419         osd.12                   up  1.00000 1.00000
 13   hdd  0.45419         osd.13                   up  1.00000 1.00000
 14   hdd  0.45419         osd.14                   up  1.00000 1.00000
 15   hdd  0.45419         osd.15                   up  1.00000 1.00000
 16   hdd  0.45419         osd.16                   up  1.00000 1.00000
 17   hdd  0.45419         osd.17                   up  1.00000 1.00000
-13        3.18005     host px-foxtrott-cluster
 18   hdd  0.45419         osd.18                   up  1.00000 1.00000
 19   hdd  0.45419         osd.19                   up  1.00000 1.00000
 20   hdd  0.45419         osd.20                   up  1.00000 1.00000
 21   hdd  0.90909         osd.21                   up  1.00000 1.00000
 22   hdd  0.45419         osd.22                   up  1.00000 1.00000
 23   hdd  0.45419         osd.23                   up  1.00000 1.00000
#### End OSD Tree ####

#### CRUSH map ####
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host px-alpha-cluster {
  id -2   # do not change unnecessarily
  id -6 class hdd   # do not change unnecessarily
  # weight 1.364
  alg straw
  hash 0  # rjenkins1
  item osd.0 weight 0.227
  item osd.1 weight 0.227
  item osd.2 weight 0.910
}
host px-bravo-cluster {
  id -3   # do not change unnecessarily
  id -7 class hdd   # do not change unnecessarily
  # weight 1.364
  alg straw
  hash 0  # rjenkins1
  item osd.3 weight 0.227
  item osd.4 weight 0.227
  item osd.5 weight 0.910
}
host px-charlie-cluster {
  id -4   # do not change unnecessarily
  id -8 class hdd   # do not change unnecessarily
  # weight 2.046
  alg straw
  hash 0  # rjenkins1
  item osd.7 weight 0.227
  item osd.8 weight 0.910
  item osd.6 weight 0.910
}
host px-delta-cluster {
  id -5   # do not change unnecessarily
  id -9 class hdd   # do not change unnecessarily
  # weight 2.046
  alg straw
  hash 0  # rjenkins1
  item osd.9 weight 0.227
  item osd.10 weight 0.910
  item osd.11 weight 0.910
}
host px-echo-cluster {
  id -11    # do not change unnecessarily
  id -12 class hdd    # do not change unnecessarily
  # weight 2.725
  alg straw2
  hash 0  # rjenkins1
  item osd.12 weight 0.454
  item osd.13 weight 0.454
  item osd.14 weight 0.454
  item osd.16 weight 0.454
  item osd.17 weight 0.454
  item osd.15 weight 0.454
}
host px-foxtrott-cluster {
  id -13    # do not change unnecessarily
  id -14 class hdd    # do not change unnecessarily
  # weight 3.180
  alg straw2
  hash 0  # rjenkins1
  item osd.18 weight 0.454
  item osd.19 weight 0.454
  item osd.20 weight 0.454
  item osd.22 weight 0.454
  item osd.23 weight 0.454
  item osd.21 weight 0.909
}
root default {
  id -1   # do not change unnecessarily
  id -10 class hdd    # do not change unnecessarily
  # weight 12.727
  alg straw
  hash 0  # rjenkins1
  item px-alpha-cluster weight 1.364
  item px-bravo-cluster weight 1.364
  item px-charlie-cluster weight 2.046
  item px-delta-cluster weight 2.046
  item px-echo-cluster weight 2.725
  item px-foxtrott-cluster weight 3.180
}

# rules
rule replicated_ruleset {
  id 0
  type replicated
  min_size 1
  max_size 10
  step take default
  step chooseleaf firstn 0 type host
  step emit
}

# end crush map
#### End CRUSH ####