Re: Help needed for diagnosing slow_requests

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 17 Aug 2018 12:23:56 +0000 (UTC)

On Fri, 17 Aug 2018, Uwe Sauter wrote:
> 
> Dear devs,
> 
> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation…
> 
> 
> 
> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug?
> 
> 
> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end).
> The main difference between those hosts is CPU generation (Westmere / Sandybridge),  and number of OSD disks.
> 
> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox
> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management.
> 
> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs.
> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore.
> 
> 
> Now here's the thing:
> 
> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that
> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host
> with the OSD causing the slow request).
> 
> If I boot back into 4.13 then Ceph runs smoothly again.
> 
> 
> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do.
> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the
> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23")
> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is
> no commit message from OSD 9).
> 
> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD
> does not commit there should be an operation that does not succeed, should it not?
> 
> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that
> does not commit has no effect on the issue.
> 
> 
> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not
> a networking or kernel issue?

This kind of issue has usually turned out to be a networking issue in the 
past (either kernel or hardware, or some combinatino of hte two).  I would 
suggest adding debug_ms=1 and reproducing and see if the replicated op 
makes it to the blocked replica.  It sounds like it isn't.. in which case 
cranking it up to debug_ms=20 and reproducing will should you more about 
when ceph is reading data off the socket and when it isn't.  And while it 
is stuck you can identify teh fd involved, checking the socket status with 
netstat, see if the 'data waiting flag' is set or not, and so on.

But times when we've gotten to that level it's (I think) always ended up 
being either jumbo fram eissues with the network hardware or problems 
with, say, bonding.  I'm not sure how the kernel version might have 
affected the hosts interaction with the network but it seems like it's 
possible...

sage

> 
> 
> I can provide more specific info if needed.
> 
> 
> Thanks,
> 
>   Uwe
> 
> 
> 
> #### Hardware details ####
> Host type 1:
>   CPU: 2x Intel Xeon E5-2670
>   RAM: 64GiB
>   Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
>   connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> 
> Host type 2:
>   CPU: 2x Intel Xeon E5606
>   RAM: 96GiB
>   Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
>   connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> #### End Hardware ####
> 
> #### Ceph OSD Tree ####
> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
>  -1       12.72653 root default
>  -2        1.36418     host px-alpha-cluster
>   0   hdd  0.22729         osd.0                    up  1.00000 1.00000
>   1   hdd  0.22729         osd.1                    up  1.00000 1.00000
>   2   hdd  0.90959         osd.2                    up  1.00000 1.00000
>  -3        1.36418     host px-bravo-cluster
>   3   hdd  0.22729         osd.3                    up  1.00000 1.00000
>   4   hdd  0.22729         osd.4                    up  1.00000 1.00000
>   5   hdd  0.90959         osd.5                    up  1.00000 1.00000
>  -4        2.04648     host px-charlie-cluster
>   6   hdd  0.90959         osd.6                    up  1.00000 1.00000
>   7   hdd  0.22729         osd.7                    up  1.00000 1.00000
>   8   hdd  0.90959         osd.8                    up  1.00000 1.00000
>  -5        2.04648     host px-delta-cluster
>   9   hdd  0.22729         osd.9                    up  1.00000 1.00000
>  10   hdd  0.90959         osd.10                   up  1.00000 1.00000
>  11   hdd  0.90959         osd.11                   up  1.00000 1.00000
> -11        2.72516     host px-echo-cluster
>  12   hdd  0.45419         osd.12                   up  1.00000 1.00000
>  13   hdd  0.45419         osd.13                   up  1.00000 1.00000
>  14   hdd  0.45419         osd.14                   up  1.00000 1.00000
>  15   hdd  0.45419         osd.15                   up  1.00000 1.00000
>  16   hdd  0.45419         osd.16                   up  1.00000 1.00000
>  17   hdd  0.45419         osd.17                   up  1.00000 1.00000
> -13        3.18005     host px-foxtrott-cluster
>  18   hdd  0.45419         osd.18                   up  1.00000 1.00000
>  19   hdd  0.45419         osd.19                   up  1.00000 1.00000
>  20   hdd  0.45419         osd.20                   up  1.00000 1.00000
>  21   hdd  0.90909         osd.21                   up  1.00000 1.00000
>  22   hdd  0.45419         osd.22                   up  1.00000 1.00000
>  23   hdd  0.45419         osd.23                   up  1.00000 1.00000
> #### End OSD Tree ####
> 
> #### CRUSH map ####
> # begin crush map
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 50
> tunable chooseleaf_descend_once 1
> tunable chooseleaf_vary_r 1
> tunable chooseleaf_stable 1
> tunable straw_calc_version 1
> tunable allowed_bucket_algs 54
> 
> # devices
> device 0 osd.0 class hdd
> device 1 osd.1 class hdd
> device 2 osd.2 class hdd
> device 3 osd.3 class hdd
> device 4 osd.4 class hdd
> device 5 osd.5 class hdd
> device 6 osd.6 class hdd
> device 7 osd.7 class hdd
> device 8 osd.8 class hdd
> device 9 osd.9 class hdd
> device 10 osd.10 class hdd
> device 11 osd.11 class hdd
> device 12 osd.12 class hdd
> device 13 osd.13 class hdd
> device 14 osd.14 class hdd
> device 15 osd.15 class hdd
> device 16 osd.16 class hdd
> device 17 osd.17 class hdd
> device 18 osd.18 class hdd
> device 19 osd.19 class hdd
> device 20 osd.20 class hdd
> device 21 osd.21 class hdd
> device 22 osd.22 class hdd
> device 23 osd.23 class hdd
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host px-alpha-cluster {
>   id -2   # do not change unnecessarily
>   id -6 class hdd   # do not change unnecessarily
>   # weight 1.364
>   alg straw
>   hash 0  # rjenkins1
>   item osd.0 weight 0.227
>   item osd.1 weight 0.227
>   item osd.2 weight 0.910
> }
> host px-bravo-cluster {
>   id -3   # do not change unnecessarily
>   id -7 class hdd   # do not change unnecessarily
>   # weight 1.364
>   alg straw
>   hash 0  # rjenkins1
>   item osd.3 weight 0.227
>   item osd.4 weight 0.227
>   item osd.5 weight 0.910
> }
> host px-charlie-cluster {
>   id -4   # do not change unnecessarily
>   id -8 class hdd   # do not change unnecessarily
>   # weight 2.046
>   alg straw
>   hash 0  # rjenkins1
>   item osd.7 weight 0.227
>   item osd.8 weight 0.910
>   item osd.6 weight 0.910
> }
> host px-delta-cluster {
>   id -5   # do not change unnecessarily
>   id -9 class hdd   # do not change unnecessarily
>   # weight 2.046
>   alg straw
>   hash 0  # rjenkins1
>   item osd.9 weight 0.227
>   item osd.10 weight 0.910
>   item osd.11 weight 0.910
> }
> host px-echo-cluster {
>   id -11    # do not change unnecessarily
>   id -12 class hdd    # do not change unnecessarily
>   # weight 2.725
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.12 weight 0.454
>   item osd.13 weight 0.454
>   item osd.14 weight 0.454
>   item osd.16 weight 0.454
>   item osd.17 weight 0.454
>   item osd.15 weight 0.454
> }
> host px-foxtrott-cluster {
>   id -13    # do not change unnecessarily
>   id -14 class hdd    # do not change unnecessarily
>   # weight 3.180
>   alg straw2
>   hash 0  # rjenkins1
>   item osd.18 weight 0.454
>   item osd.19 weight 0.454
>   item osd.20 weight 0.454
>   item osd.22 weight 0.454
>   item osd.23 weight 0.454
>   item osd.21 weight 0.909
> }
> root default {
>   id -1   # do not change unnecessarily
>   id -10 class hdd    # do not change unnecessarily
>   # weight 12.727
>   alg straw
>   hash 0  # rjenkins1
>   item px-alpha-cluster weight 1.364
>   item px-bravo-cluster weight 1.364
>   item px-charlie-cluster weight 2.046
>   item px-delta-cluster weight 2.046
>   item px-echo-cluster weight 2.725
>   item px-foxtrott-cluster weight 3.180
> }
> 
> # rules
> rule replicated_ruleset {
>   id 0
>   type replicated
>   min_size 1
>   max_size 10
>   step take default
>   step chooseleaf firstn 0 type host
>   step emit
> }
> 
> # end crush map
> #### End CRUSH ####
> 
>