Hey Sage, I just saw your comments about Jumbo frame might cause this kind slow request. we just met this kind issue a few weeks ago. but there is also osd op thread timeout and even reaches suicide timeout. I did trace the osd log with 20 severity. from what i can see, that timeout thread is just not get executed for these time. that leads corresponding slow request is stucked in event “queued_for_pg”. it is very strange to me why the misconfigured mtu on the network side would cause the osd op thread timeout ? Thanks & Regards, Dongdong 在 2018年8月17日,下午8:29,Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道: Am 17.08.18 um 14:23 schrieb Sage Weil: On Fri, 17 Aug 2018, Uwe Sauter wrote: Dear devs, I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation… TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug? I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end). The main difference between those hosts is CPU generation (Westmere / Sandybridge), and number of OSD disks. The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management. On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs. OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore. Now here's the thing: Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host with the OSD causing the slow request). If I boot back into 4.13 then Ceph runs smoothly again. I'm seeking for help to debug this issue as I'm running out of ideas what I could else do. So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23") but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is no commit message from OSD 9). On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD does not commit there should be an operation that does not succeed, should it not? Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that does not commit has no effect on the issue. Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not a networking or kernel issue? This kind of issue has usually turned out to be a networking issue in the past (either kernel or hardware, or some combinatino of hte two). I would suggest adding debug_ms=1 and reproducing and see if the replicated op makes it to the blocked replica. It sounds like it isn't.. in which case cranking it up to debug_ms=20 and reproducing will should you more about when ceph is reading data off the socket and when it isn't. And while it is stuck you can identify teh fd involved, checking the socket status with netstat, see if the 'data waiting flag' is set or not, and so on. But times when we've gotten to that level it's (I think) always ended up being either jumbo fram eissues with the network hardware or problems with, say, bonding. I'm not sure how the kernel version might have affected the hosts interaction with the network but it seems like it's possible... sage Sage, thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to mention). Just to make sure I understand correctly: - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf? - And the effect is that there will be debug output in the log files? And even more, when set to 20? Have a nice weekend, Uwe I can provide more specific info if needed. Thanks, Uwe #### Hardware details #### Host type 1: CPU: 2x Intel Xeon E5-2670 RAM: 64GiB Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB) connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) Host type 2: CPU: 2x Intel Xeon E5606 RAM: 96GiB Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB) connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) #### End Hardware #### #### Ceph OSD Tree #### ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 12.72653 root default -2 1.36418 host px-alpha-cluster 0 hdd 0.22729 osd.0 up 1.00000 1.00000 1 hdd 0.22729 osd.1 up 1.00000 1.00000 2 hdd 0.90959 osd.2 up 1.00000 1.00000 -3 1.36418 host px-bravo-cluster 3 hdd 0.22729 osd.3 up 1.00000 1.00000 4 hdd 0.22729 osd.4 up 1.00000 1.00000 5 hdd 0.90959 osd.5 up 1.00000 1.00000 -4 2.04648 host px-charlie-cluster 6 hdd 0.90959 osd.6 up 1.00000 1.00000 7 hdd 0.22729 osd.7 up 1.00000 1.00000 8 hdd 0.90959 osd.8 up 1.00000 1.00000 -5 2.04648 host px-delta-cluster 9 hdd 0.22729 osd.9 up 1.00000 1.00000 10 hdd 0.90959 osd.10 up 1.00000 1.00000 11 hdd 0.90959 osd.11 up 1.00000 1.00000 -11 2.72516 host px-echo-cluster 12 hdd 0.45419 osd.12 up 1.00000 1.00000 13 hdd 0.45419 osd.13 up 1.00000 1.00000 14 hdd 0.45419 osd.14 up 1.00000 1.00000 15 hdd 0.45419 osd.15 up 1.00000 1.00000 16 hdd 0.45419 osd.16 up 1.00000 1.00000 17 hdd 0.45419 osd.17 up 1.00000 1.00000 -13 3.18005 host px-foxtrott-cluster 18 hdd 0.45419 osd.18 up 1.00000 1.00000 19 hdd 0.45419 osd.19 up 1.00000 1.00000 20 hdd 0.45419 osd.20 up 1.00000 1.00000 21 hdd 0.90909 osd.21 up 1.00000 1.00000 22 hdd 0.45419 osd.22 up 1.00000 1.00000 23 hdd 0.45419 osd.23 up 1.00000 1.00000 #### End OSD Tree #### #### CRUSH map #### # begin crush map tunable choose_local_tries 0 tunable choose_local_fallback_tries 0 tunable choose_total_tries 50 tunable chooseleaf_descend_once 1 tunable chooseleaf_vary_r 1 tunable chooseleaf_stable 1 tunable straw_calc_version 1 tunable allowed_bucket_algs 54 # devices device 0 osd.0 class hdd device 1 osd.1 class hdd device 2 osd.2 class hdd device 3 osd.3 class hdd device 4 osd.4 class hdd device 5 osd.5 class hdd device 6 osd.6 class hdd device 7 osd.7 class hdd device 8 osd.8 class hdd device 9 osd.9 class hdd device 10 osd.10 class hdd device 11 osd.11 class hdd device 12 osd.12 class hdd device 13 osd.13 class hdd device 14 osd.14 class hdd device 15 osd.15 class hdd device 16 osd.16 class hdd device 17 osd.17 class hdd device 18 osd.18 class hdd device 19 osd.19 class hdd device 20 osd.20 class hdd device 21 osd.21 class hdd device 22 osd.22 class hdd device 23 osd.23 class hdd # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root # buckets host px-alpha-cluster { id -2 # do not change unnecessarily id -6 class hdd # do not change unnecessarily # weight 1.364 alg straw hash 0 # rjenkins1 item osd.0 weight 0.227 item osd.1 weight 0.227 item osd.2 weight 0.910 } host px-bravo-cluster { id -3 # do not change unnecessarily id -7 class hdd # do not change unnecessarily # weight 1.364 alg straw hash 0 # rjenkins1 item osd.3 weight 0.227 item osd.4 weight 0.227 item osd.5 weight 0.910 } host px-charlie-cluster { id -4 # do not change unnecessarily id -8 class hdd # do not change unnecessarily # weight 2.046 alg straw hash 0 # rjenkins1 item osd.7 weight 0.227 item osd.8 weight 0.910 item osd.6 weight 0.910 } host px-delta-cluster { id -5 # do not change unnecessarily id -9 class hdd # do not change unnecessarily # weight 2.046 alg straw hash 0 # rjenkins1 item osd.9 weight 0.227 item osd.10 weight 0.910 item osd.11 weight 0.910 } host px-echo-cluster { id -11 # do not change unnecessarily id -12 class hdd # do not change unnecessarily # weight 2.725 alg straw2 hash 0 # rjenkins1 item osd.12 weight 0.454 item osd.13 weight 0.454 item osd.14 weight 0.454 item osd.16 weight 0.454 item osd.17 weight 0.454 item osd.15 weight 0.454 } host px-foxtrott-cluster { id -13 # do not change unnecessarily id -14 class hdd # do not change unnecessarily # weight 3.180 alg straw2 hash 0 # rjenkins1 item osd.18 weight 0.454 item osd.19 weight 0.454 item osd.20 weight 0.454 item osd.22 weight 0.454 item osd.23 weight 0.454 item osd.21 weight 0.909 } root default { id -1 # do not change unnecessarily id -10 class hdd # do not change unnecessarily # weight 12.727 alg straw hash 0 # rjenkins1 item px-alpha-cluster weight 1.364 item px-bravo-cluster weight 1.364 item px-charlie-cluster weight 2.046 item px-delta-cluster weight 2.046 item px-echo-cluster weight 2.725 item px-foxtrott-cluster weight 3.180 } # rules rule replicated_ruleset { id 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map #### End CRUSH ####