On Fri, 17 Aug 2018, Uwe Sauter wrote: > Am 17.08.18 um 14:23 schrieb Sage Weil: > > On Fri, 17 Aug 2018, Uwe Sauter wrote: > >> > >> Dear devs, > >> > >> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation… > >> > >> > >> > >> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug? > >> > >> > >> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end). > >> The main difference between those hosts is CPU generation (Westmere / Sandybridge), and number of OSD disks. > >> > >> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox > >> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management. > >> > >> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs. > >> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore. > >> > >> > >> Now here's the thing: > >> > >> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that > >> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host > >> with the OSD causing the slow request). > >> > >> If I boot back into 4.13 then Ceph runs smoothly again. > >> > >> > >> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do. > >> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the > >> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23") > >> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is > >> no commit message from OSD 9). > >> > >> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD > >> does not commit there should be an operation that does not succeed, should it not? > >> > >> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that > >> does not commit has no effect on the issue. > >> > >> > >> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not > >> a networking or kernel issue? > > > > This kind of issue has usually turned out to be a networking issue in the > > past (either kernel or hardware, or some combinatino of hte two). I would > > suggest adding debug_ms=1 and reproducing and see if the replicated op > > makes it to the blocked replica. It sounds like it isn't.. in which case > > cranking it up to debug_ms=20 and reproducing will should you more about > > when ceph is reading data off the socket and when it isn't. And while it > > is stuck you can identify teh fd involved, checking the socket status with > > netstat, see if the 'data waiting flag' is set or not, and so on. > > > > But times when we've gotten to that level it's (I think) always ended up > > being either jumbo fram eissues with the network hardware or problems > > with, say, bonding. I'm not sure how the kernel version might have > > affected the hosts interaction with the network but it seems like it's > > possible... > > > > sage > > > > Sage, > > thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to > mention). > > Just to make sure I understand correctly: > > - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf? > - And the effect is that there will be debug output in the log files? And even more, when set to 20? Right. Start with 1 and confirm that the message isn't arriving at the replica but is sent on the primary... if so, then higher debug levels will be needed. Level 20 will geneerate a *lot* of output and may make it hard to reproduce. s > > > Have a nice weekend, > > Uwe > > > > > >> > >> > >> I can provide more specific info if needed. > >> > >> > >> Thanks, > >> > >> Uwe > >> > >> > >> > >> #### Hardware details #### > >> Host type 1: > >> CPU: 2x Intel Xeon E5-2670 > >> RAM: 64GiB > >> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB) > >> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) > >> > >> Host type 2: > >> CPU: 2x Intel Xeon E5606 > >> RAM: 96GiB > >> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB) > >> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) > >> #### End Hardware #### > >> > >> #### Ceph OSD Tree #### > >> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF > >> -1 12.72653 root default > >> -2 1.36418 host px-alpha-cluster > >> 0 hdd 0.22729 osd.0 up 1.00000 1.00000 > >> 1 hdd 0.22729 osd.1 up 1.00000 1.00000 > >> 2 hdd 0.90959 osd.2 up 1.00000 1.00000 > >> -3 1.36418 host px-bravo-cluster > >> 3 hdd 0.22729 osd.3 up 1.00000 1.00000 > >> 4 hdd 0.22729 osd.4 up 1.00000 1.00000 > >> 5 hdd 0.90959 osd.5 up 1.00000 1.00000 > >> -4 2.04648 host px-charlie-cluster > >> 6 hdd 0.90959 osd.6 up 1.00000 1.00000 > >> 7 hdd 0.22729 osd.7 up 1.00000 1.00000 > >> 8 hdd 0.90959 osd.8 up 1.00000 1.00000 > >> -5 2.04648 host px-delta-cluster > >> 9 hdd 0.22729 osd.9 up 1.00000 1.00000 > >> 10 hdd 0.90959 osd.10 up 1.00000 1.00000 > >> 11 hdd 0.90959 osd.11 up 1.00000 1.00000 > >> -11 2.72516 host px-echo-cluster > >> 12 hdd 0.45419 osd.12 up 1.00000 1.00000 > >> 13 hdd 0.45419 osd.13 up 1.00000 1.00000 > >> 14 hdd 0.45419 osd.14 up 1.00000 1.00000 > >> 15 hdd 0.45419 osd.15 up 1.00000 1.00000 > >> 16 hdd 0.45419 osd.16 up 1.00000 1.00000 > >> 17 hdd 0.45419 osd.17 up 1.00000 1.00000 > >> -13 3.18005 host px-foxtrott-cluster > >> 18 hdd 0.45419 osd.18 up 1.00000 1.00000 > >> 19 hdd 0.45419 osd.19 up 1.00000 1.00000 > >> 20 hdd 0.45419 osd.20 up 1.00000 1.00000 > >> 21 hdd 0.90909 osd.21 up 1.00000 1.00000 > >> 22 hdd 0.45419 osd.22 up 1.00000 1.00000 > >> 23 hdd 0.45419 osd.23 up 1.00000 1.00000 > >> #### End OSD Tree #### > >> > >> #### CRUSH map #### > >> # begin crush map > >> tunable choose_local_tries 0 > >> tunable choose_local_fallback_tries 0 > >> tunable choose_total_tries 50 > >> tunable chooseleaf_descend_once 1 > >> tunable chooseleaf_vary_r 1 > >> tunable chooseleaf_stable 1 > >> tunable straw_calc_version 1 > >> tunable allowed_bucket_algs 54 > >> > >> # devices > >> device 0 osd.0 class hdd > >> device 1 osd.1 class hdd > >> device 2 osd.2 class hdd > >> device 3 osd.3 class hdd > >> device 4 osd.4 class hdd > >> device 5 osd.5 class hdd > >> device 6 osd.6 class hdd > >> device 7 osd.7 class hdd > >> device 8 osd.8 class hdd > >> device 9 osd.9 class hdd > >> device 10 osd.10 class hdd > >> device 11 osd.11 class hdd > >> device 12 osd.12 class hdd > >> device 13 osd.13 class hdd > >> device 14 osd.14 class hdd > >> device 15 osd.15 class hdd > >> device 16 osd.16 class hdd > >> device 17 osd.17 class hdd > >> device 18 osd.18 class hdd > >> device 19 osd.19 class hdd > >> device 20 osd.20 class hdd > >> device 21 osd.21 class hdd > >> device 22 osd.22 class hdd > >> device 23 osd.23 class hdd > >> > >> # types > >> type 0 osd > >> type 1 host > >> type 2 chassis > >> type 3 rack > >> type 4 row > >> type 5 pdu > >> type 6 pod > >> type 7 room > >> type 8 datacenter > >> type 9 region > >> type 10 root > >> > >> # buckets > >> host px-alpha-cluster { > >> id -2 # do not change unnecessarily > >> id -6 class hdd # do not change unnecessarily > >> # weight 1.364 > >> alg straw > >> hash 0 # rjenkins1 > >> item osd.0 weight 0.227 > >> item osd.1 weight 0.227 > >> item osd.2 weight 0.910 > >> } > >> host px-bravo-cluster { > >> id -3 # do not change unnecessarily > >> id -7 class hdd # do not change unnecessarily > >> # weight 1.364 > >> alg straw > >> hash 0 # rjenkins1 > >> item osd.3 weight 0.227 > >> item osd.4 weight 0.227 > >> item osd.5 weight 0.910 > >> } > >> host px-charlie-cluster { > >> id -4 # do not change unnecessarily > >> id -8 class hdd # do not change unnecessarily > >> # weight 2.046 > >> alg straw > >> hash 0 # rjenkins1 > >> item osd.7 weight 0.227 > >> item osd.8 weight 0.910 > >> item osd.6 weight 0.910 > >> } > >> host px-delta-cluster { > >> id -5 # do not change unnecessarily > >> id -9 class hdd # do not change unnecessarily > >> # weight 2.046 > >> alg straw > >> hash 0 # rjenkins1 > >> item osd.9 weight 0.227 > >> item osd.10 weight 0.910 > >> item osd.11 weight 0.910 > >> } > >> host px-echo-cluster { > >> id -11 # do not change unnecessarily > >> id -12 class hdd # do not change unnecessarily > >> # weight 2.725 > >> alg straw2 > >> hash 0 # rjenkins1 > >> item osd.12 weight 0.454 > >> item osd.13 weight 0.454 > >> item osd.14 weight 0.454 > >> item osd.16 weight 0.454 > >> item osd.17 weight 0.454 > >> item osd.15 weight 0.454 > >> } > >> host px-foxtrott-cluster { > >> id -13 # do not change unnecessarily > >> id -14 class hdd # do not change unnecessarily > >> # weight 3.180 > >> alg straw2 > >> hash 0 # rjenkins1 > >> item osd.18 weight 0.454 > >> item osd.19 weight 0.454 > >> item osd.20 weight 0.454 > >> item osd.22 weight 0.454 > >> item osd.23 weight 0.454 > >> item osd.21 weight 0.909 > >> } > >> root default { > >> id -1 # do not change unnecessarily > >> id -10 class hdd # do not change unnecessarily > >> # weight 12.727 > >> alg straw > >> hash 0 # rjenkins1 > >> item px-alpha-cluster weight 1.364 > >> item px-bravo-cluster weight 1.364 > >> item px-charlie-cluster weight 2.046 > >> item px-delta-cluster weight 2.046 > >> item px-echo-cluster weight 2.725 > >> item px-foxtrott-cluster weight 3.180 > >> } > >> > >> # rules > >> rule replicated_ruleset { > >> id 0 > >> type replicated > >> min_size 1 > >> max_size 10 > >> step take default > >> step chooseleaf firstn 0 type host > >> step emit > >> } > >> > >> # end crush map > >> #### End CRUSH #### > >> > >