Re: Help needed for diagnosing slow_requests

Sage Weil <sage@xxxxxxxxxxxx> · Fri, 17 Aug 2018 12:33:51 +0000 (UTC)

On Fri, 17 Aug 2018, Uwe Sauter wrote:
> Am 17.08.18 um 14:23 schrieb Sage Weil:
> > On Fri, 17 Aug 2018, Uwe Sauter wrote:
> >>
> >> Dear devs,
> >>
> >> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation…
> >>
> >>
> >>
> >> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug?
> >>
> >>
> >> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end).
> >> The main difference between those hosts is CPU generation (Westmere / Sandybridge),  and number of OSD disks.
> >>
> >> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox
> >> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management.
> >>
> >> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs.
> >> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore.
> >>
> >>
> >> Now here's the thing:
> >>
> >> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that
> >> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host
> >> with the OSD causing the slow request).
> >>
> >> If I boot back into 4.13 then Ceph runs smoothly again.
> >>
> >>
> >> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do.
> >> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the
> >> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23")
> >> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is
> >> no commit message from OSD 9).
> >>
> >> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD
> >> does not commit there should be an operation that does not succeed, should it not?
> >>
> >> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that
> >> does not commit has no effect on the issue.
> >>
> >>
> >> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not
> >> a networking or kernel issue?
> > 
> > This kind of issue has usually turned out to be a networking issue in the 
> > past (either kernel or hardware, or some combinatino of hte two).  I would 
> > suggest adding debug_ms=1 and reproducing and see if the replicated op 
> > makes it to the blocked replica.  It sounds like it isn't.. in which case 
> > cranking it up to debug_ms=20 and reproducing will should you more about 
> > when ceph is reading data off the socket and when it isn't.  And while it 
> > is stuck you can identify teh fd involved, checking the socket status with 
> > netstat, see if the 'data waiting flag' is set or not, and so on.
> > 
> > But times when we've gotten to that level it's (I think) always ended up 
> > being either jumbo fram eissues with the network hardware or problems 
> > with, say, bonding.  I'm not sure how the kernel version might have 
> > affected the hosts interaction with the network but it seems like it's 
> > possible...
> > 
> > sage
> > 
> 
> Sage,
> 
> thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to
> mention).
> 
> Just to make sure I understand correctly:
> 
> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
> - And the effect is that there will be debug output in the log files? And even more, when set to 20?

Right. Start with 1 and confirm that the message isn't arriving at the 
replica but is sent on the primary... if so, then higher debug levels will 
be needed.  Level 20 will geneerate a *lot* of output and may make it hard 
to reproduce.

s

 > > 
> Have a nice weekend,
> 
> 	Uwe
> 
> 
> > 
> >>
> >>
> >> I can provide more specific info if needed.
> >>
> >>
> >> Thanks,
> >>
> >>   Uwe
> >>
> >>
> >>
> >> #### Hardware details ####
> >> Host type 1:
> >>   CPU: 2x Intel Xeon E5-2670
> >>   RAM: 64GiB
> >>   Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
> >>   connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> >>
> >> Host type 2:
> >>   CPU: 2x Intel Xeon E5606
> >>   RAM: 96GiB
> >>   Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
> >>   connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> >> #### End Hardware ####
> >>
> >> #### Ceph OSD Tree ####
> >> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
> >>  -1       12.72653 root default
> >>  -2        1.36418     host px-alpha-cluster
> >>   0   hdd  0.22729         osd.0                    up  1.00000 1.00000
> >>   1   hdd  0.22729         osd.1                    up  1.00000 1.00000
> >>   2   hdd  0.90959         osd.2                    up  1.00000 1.00000
> >>  -3        1.36418     host px-bravo-cluster
> >>   3   hdd  0.22729         osd.3                    up  1.00000 1.00000
> >>   4   hdd  0.22729         osd.4                    up  1.00000 1.00000
> >>   5   hdd  0.90959         osd.5                    up  1.00000 1.00000
> >>  -4        2.04648     host px-charlie-cluster
> >>   6   hdd  0.90959         osd.6                    up  1.00000 1.00000
> >>   7   hdd  0.22729         osd.7                    up  1.00000 1.00000
> >>   8   hdd  0.90959         osd.8                    up  1.00000 1.00000
> >>  -5        2.04648     host px-delta-cluster
> >>   9   hdd  0.22729         osd.9                    up  1.00000 1.00000
> >>  10   hdd  0.90959         osd.10                   up  1.00000 1.00000
> >>  11   hdd  0.90959         osd.11                   up  1.00000 1.00000
> >> -11        2.72516     host px-echo-cluster
> >>  12   hdd  0.45419         osd.12                   up  1.00000 1.00000
> >>  13   hdd  0.45419         osd.13                   up  1.00000 1.00000
> >>  14   hdd  0.45419         osd.14                   up  1.00000 1.00000
> >>  15   hdd  0.45419         osd.15                   up  1.00000 1.00000
> >>  16   hdd  0.45419         osd.16                   up  1.00000 1.00000
> >>  17   hdd  0.45419         osd.17                   up  1.00000 1.00000
> >> -13        3.18005     host px-foxtrott-cluster
> >>  18   hdd  0.45419         osd.18                   up  1.00000 1.00000
> >>  19   hdd  0.45419         osd.19                   up  1.00000 1.00000
> >>  20   hdd  0.45419         osd.20                   up  1.00000 1.00000
> >>  21   hdd  0.90909         osd.21                   up  1.00000 1.00000
> >>  22   hdd  0.45419         osd.22                   up  1.00000 1.00000
> >>  23   hdd  0.45419         osd.23                   up  1.00000 1.00000
> >> #### End OSD Tree ####
> >>
> >> #### CRUSH map ####
> >> # begin crush map
> >> tunable choose_local_tries 0
> >> tunable choose_local_fallback_tries 0
> >> tunable choose_total_tries 50
> >> tunable chooseleaf_descend_once 1
> >> tunable chooseleaf_vary_r 1
> >> tunable chooseleaf_stable 1
> >> tunable straw_calc_version 1
> >> tunable allowed_bucket_algs 54
> >>
> >> # devices
> >> device 0 osd.0 class hdd
> >> device 1 osd.1 class hdd
> >> device 2 osd.2 class hdd
> >> device 3 osd.3 class hdd
> >> device 4 osd.4 class hdd
> >> device 5 osd.5 class hdd
> >> device 6 osd.6 class hdd
> >> device 7 osd.7 class hdd
> >> device 8 osd.8 class hdd
> >> device 9 osd.9 class hdd
> >> device 10 osd.10 class hdd
> >> device 11 osd.11 class hdd
> >> device 12 osd.12 class hdd
> >> device 13 osd.13 class hdd
> >> device 14 osd.14 class hdd
> >> device 15 osd.15 class hdd
> >> device 16 osd.16 class hdd
> >> device 17 osd.17 class hdd
> >> device 18 osd.18 class hdd
> >> device 19 osd.19 class hdd
> >> device 20 osd.20 class hdd
> >> device 21 osd.21 class hdd
> >> device 22 osd.22 class hdd
> >> device 23 osd.23 class hdd
> >>
> >> # types
> >> type 0 osd
> >> type 1 host
> >> type 2 chassis
> >> type 3 rack
> >> type 4 row
> >> type 5 pdu
> >> type 6 pod
> >> type 7 room
> >> type 8 datacenter
> >> type 9 region
> >> type 10 root
> >>
> >> # buckets
> >> host px-alpha-cluster {
> >>   id -2   # do not change unnecessarily
> >>   id -6 class hdd   # do not change unnecessarily
> >>   # weight 1.364
> >>   alg straw
> >>   hash 0  # rjenkins1
> >>   item osd.0 weight 0.227
> >>   item osd.1 weight 0.227
> >>   item osd.2 weight 0.910
> >> }
> >> host px-bravo-cluster {
> >>   id -3   # do not change unnecessarily
> >>   id -7 class hdd   # do not change unnecessarily
> >>   # weight 1.364
> >>   alg straw
> >>   hash 0  # rjenkins1
> >>   item osd.3 weight 0.227
> >>   item osd.4 weight 0.227
> >>   item osd.5 weight 0.910
> >> }
> >> host px-charlie-cluster {
> >>   id -4   # do not change unnecessarily
> >>   id -8 class hdd   # do not change unnecessarily
> >>   # weight 2.046
> >>   alg straw
> >>   hash 0  # rjenkins1
> >>   item osd.7 weight 0.227
> >>   item osd.8 weight 0.910
> >>   item osd.6 weight 0.910
> >> }
> >> host px-delta-cluster {
> >>   id -5   # do not change unnecessarily
> >>   id -9 class hdd   # do not change unnecessarily
> >>   # weight 2.046
> >>   alg straw
> >>   hash 0  # rjenkins1
> >>   item osd.9 weight 0.227
> >>   item osd.10 weight 0.910
> >>   item osd.11 weight 0.910
> >> }
> >> host px-echo-cluster {
> >>   id -11    # do not change unnecessarily
> >>   id -12 class hdd    # do not change unnecessarily
> >>   # weight 2.725
> >>   alg straw2
> >>   hash 0  # rjenkins1
> >>   item osd.12 weight 0.454
> >>   item osd.13 weight 0.454
> >>   item osd.14 weight 0.454
> >>   item osd.16 weight 0.454
> >>   item osd.17 weight 0.454
> >>   item osd.15 weight 0.454
> >> }
> >> host px-foxtrott-cluster {
> >>   id -13    # do not change unnecessarily
> >>   id -14 class hdd    # do not change unnecessarily
> >>   # weight 3.180
> >>   alg straw2
> >>   hash 0  # rjenkins1
> >>   item osd.18 weight 0.454
> >>   item osd.19 weight 0.454
> >>   item osd.20 weight 0.454
> >>   item osd.22 weight 0.454
> >>   item osd.23 weight 0.454
> >>   item osd.21 weight 0.909
> >> }
> >> root default {
> >>   id -1   # do not change unnecessarily
> >>   id -10 class hdd    # do not change unnecessarily
> >>   # weight 12.727
> >>   alg straw
> >>   hash 0  # rjenkins1
> >>   item px-alpha-cluster weight 1.364
> >>   item px-bravo-cluster weight 1.364
> >>   item px-charlie-cluster weight 2.046
> >>   item px-delta-cluster weight 2.046
> >>   item px-echo-cluster weight 2.725
> >>   item px-foxtrott-cluster weight 3.180
> >> }
> >>
> >> # rules
> >> rule replicated_ruleset {
> >>   id 0
> >>   type replicated
> >>   min_size 1
> >>   max_size 10
> >>   step take default
> >>   step chooseleaf firstn 0 type host
> >>   step emit
> >> }
> >>
> >> # end crush map
> >> #### End CRUSH ####
> >>
> 
>