Hi Sage, It turned out be to a Jumbo frame issue eventually. after we fixed the MTU issue. there is no slow request anymore, and no osd op thread timeout/suicide anymore. that’s the weird thing to me, it’s hard to see any connection between the MTU and osd op thread or filestore thread. Thanks, Dongdong > 在 2018年8月25日,下午10:26,Sage Weil <sage@xxxxxxxxxxxx> 写道: > > Hi Dongdong, > > I think you're right--if the op thread is stuck then it's not a networking > issue. My next guess would be that there is an object that the backend is > stuck processing, like an rgw index object with too many object entries. > If you turn up debug filestore = 20 on the running daemon while it is > stuck (ceph daemon osd.NNN config set debug_filestore 20) you might see > output in the log indicating what it is working on. > > sage > > > > On Wed, 22 Aug 2018, 陶冬冬 wrote: > >> Hey Sage, >> >> I just saw your comments about Jumbo frame might cause this kind slow request. >> we just met this kind issue a few weeks ago. but there is also osd op thread timeout and even reaches suicide timeout. >> I did trace the osd log with 20 severity. from what i can see, that timeout thread is just not get executed for these time. >> that leads corresponding slow request is stucked in event “queued_for_pg”. >> it is very strange to me why the misconfigured mtu on the network side would cause the osd op thread timeout ? >> >> Thanks & Regards, >> Dongdong >>> 在 2018年8月17日,下午8:29,Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道: >>> >>> Am 17.08.18 um 14:23 schrieb Sage Weil: >>>> On Fri, 17 Aug 2018, Uwe Sauter wrote: >>>>> >>>>> Dear devs, >>>>> >>>>> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation… >>>>> >>>>> >>>>> >>>>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug? >>>>> >>>>> >>>>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end). >>>>> The main difference between those hosts is CPU generation (Westmere / Sandybridge), and number of OSD disks. >>>>> >>>>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox >>>>> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management. >>>>> >>>>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs. >>>>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore. >>>>> >>>>> >>>>> Now here's the thing: >>>>> >>>>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that >>>>> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host >>>>> with the OSD causing the slow request). >>>>> >>>>> If I boot back into 4.13 then Ceph runs smoothly again. >>>>> >>>>> >>>>> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do. >>>>> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the >>>>> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23") >>>>> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is >>>>> no commit message from OSD 9). >>>>> >>>>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD >>>>> does not commit there should be an operation that does not succeed, should it not? >>>>> >>>>> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that >>>>> does not commit has no effect on the issue. >>>>> >>>>> >>>>> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not >>>>> a networking or kernel issue? >>>> >>>> This kind of issue has usually turned out to be a networking issue in the >>>> past (either kernel or hardware, or some combinatino of hte two). I would >>>> suggest adding debug_ms=1 and reproducing and see if the replicated op >>>> makes it to the blocked replica. It sounds like it isn't.. in which case >>>> cranking it up to debug_ms=20 and reproducing will should you more about >>>> when ceph is reading data off the socket and when it isn't. And while it >>>> is stuck you can identify teh fd involved, checking the socket status with >>>> netstat, see if the 'data waiting flag' is set or not, and so on. >>>> >>>> But times when we've gotten to that level it's (I think) always ended up >>>> being either jumbo fram eissues with the network hardware or problems >>>> with, say, bonding. I'm not sure how the kernel version might have >>>> affected the hosts interaction with the network but it seems like it's >>>> possible... >>>> >>>> sage >>>> >>> >>> Sage, >>> >>> thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to >>> mention). >>> >>> Just to make sure I understand correctly: >>> >>> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf? >>> - And the effect is that there will be debug output in the log files? And even more, when set to 20? >>> >>> >>> Have a nice weekend, >>> >>> Uwe >>> >>> >>>> >>>>> >>>>> >>>>> I can provide more specific info if needed. >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> Uwe >>>>> >>>>> >>>>> >>>>> #### Hardware details #### >>>>> Host type 1: >>>>> CPU: 2x Intel Xeon E5-2670 >>>>> RAM: 64GiB >>>>> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB) >>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) >>>>> >>>>> Host type 2: >>>>> CPU: 2x Intel Xeon E5606 >>>>> RAM: 96GiB >>>>> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB) >>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000) >>>>> #### End Hardware #### >>>>> >>>>> #### Ceph OSD Tree #### >>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF >>>>> -1 12.72653 root default >>>>> -2 1.36418 host px-alpha-cluster >>>>> 0 hdd 0.22729 osd.0 up 1.00000 1.00000 >>>>> 1 hdd 0.22729 osd.1 up 1.00000 1.00000 >>>>> 2 hdd 0.90959 osd.2 up 1.00000 1.00000 >>>>> -3 1.36418 host px-bravo-cluster >>>>> 3 hdd 0.22729 osd.3 up 1.00000 1.00000 >>>>> 4 hdd 0.22729 osd.4 up 1.00000 1.00000 >>>>> 5 hdd 0.90959 osd.5 up 1.00000 1.00000 >>>>> -4 2.04648 host px-charlie-cluster >>>>> 6 hdd 0.90959 osd.6 up 1.00000 1.00000 >>>>> 7 hdd 0.22729 osd.7 up 1.00000 1.00000 >>>>> 8 hdd 0.90959 osd.8 up 1.00000 1.00000 >>>>> -5 2.04648 host px-delta-cluster >>>>> 9 hdd 0.22729 osd.9 up 1.00000 1.00000 >>>>> 10 hdd 0.90959 osd.10 up 1.00000 1.00000 >>>>> 11 hdd 0.90959 osd.11 up 1.00000 1.00000 >>>>> -11 2.72516 host px-echo-cluster >>>>> 12 hdd 0.45419 osd.12 up 1.00000 1.00000 >>>>> 13 hdd 0.45419 osd.13 up 1.00000 1.00000 >>>>> 14 hdd 0.45419 osd.14 up 1.00000 1.00000 >>>>> 15 hdd 0.45419 osd.15 up 1.00000 1.00000 >>>>> 16 hdd 0.45419 osd.16 up 1.00000 1.00000 >>>>> 17 hdd 0.45419 osd.17 up 1.00000 1.00000 >>>>> -13 3.18005 host px-foxtrott-cluster >>>>> 18 hdd 0.45419 osd.18 up 1.00000 1.00000 >>>>> 19 hdd 0.45419 osd.19 up 1.00000 1.00000 >>>>> 20 hdd 0.45419 osd.20 up 1.00000 1.00000 >>>>> 21 hdd 0.90909 osd.21 up 1.00000 1.00000 >>>>> 22 hdd 0.45419 osd.22 up 1.00000 1.00000 >>>>> 23 hdd 0.45419 osd.23 up 1.00000 1.00000 >>>>> #### End OSD Tree #### >>>>> >>>>> #### CRUSH map #### >>>>> # begin crush map >>>>> tunable choose_local_tries 0 >>>>> tunable choose_local_fallback_tries 0 >>>>> tunable choose_total_tries 50 >>>>> tunable chooseleaf_descend_once 1 >>>>> tunable chooseleaf_vary_r 1 >>>>> tunable chooseleaf_stable 1 >>>>> tunable straw_calc_version 1 >>>>> tunable allowed_bucket_algs 54 >>>>> >>>>> # devices >>>>> device 0 osd.0 class hdd >>>>> device 1 osd.1 class hdd >>>>> device 2 osd.2 class hdd >>>>> device 3 osd.3 class hdd >>>>> device 4 osd.4 class hdd >>>>> device 5 osd.5 class hdd >>>>> device 6 osd.6 class hdd >>>>> device 7 osd.7 class hdd >>>>> device 8 osd.8 class hdd >>>>> device 9 osd.9 class hdd >>>>> device 10 osd.10 class hdd >>>>> device 11 osd.11 class hdd >>>>> device 12 osd.12 class hdd >>>>> device 13 osd.13 class hdd >>>>> device 14 osd.14 class hdd >>>>> device 15 osd.15 class hdd >>>>> device 16 osd.16 class hdd >>>>> device 17 osd.17 class hdd >>>>> device 18 osd.18 class hdd >>>>> device 19 osd.19 class hdd >>>>> device 20 osd.20 class hdd >>>>> device 21 osd.21 class hdd >>>>> device 22 osd.22 class hdd >>>>> device 23 osd.23 class hdd >>>>> >>>>> # types >>>>> type 0 osd >>>>> type 1 host >>>>> type 2 chassis >>>>> type 3 rack >>>>> type 4 row >>>>> type 5 pdu >>>>> type 6 pod >>>>> type 7 room >>>>> type 8 datacenter >>>>> type 9 region >>>>> type 10 root >>>>> >>>>> # buckets >>>>> host px-alpha-cluster { >>>>> id -2 # do not change unnecessarily >>>>> id -6 class hdd # do not change unnecessarily >>>>> # weight 1.364 >>>>> alg straw >>>>> hash 0 # rjenkins1 >>>>> item osd.0 weight 0.227 >>>>> item osd.1 weight 0.227 >>>>> item osd.2 weight 0.910 >>>>> } >>>>> host px-bravo-cluster { >>>>> id -3 # do not change unnecessarily >>>>> id -7 class hdd # do not change unnecessarily >>>>> # weight 1.364 >>>>> alg straw >>>>> hash 0 # rjenkins1 >>>>> item osd.3 weight 0.227 >>>>> item osd.4 weight 0.227 >>>>> item osd.5 weight 0.910 >>>>> } >>>>> host px-charlie-cluster { >>>>> id -4 # do not change unnecessarily >>>>> id -8 class hdd # do not change unnecessarily >>>>> # weight 2.046 >>>>> alg straw >>>>> hash 0 # rjenkins1 >>>>> item osd.7 weight 0.227 >>>>> item osd.8 weight 0.910 >>>>> item osd.6 weight 0.910 >>>>> } >>>>> host px-delta-cluster { >>>>> id -5 # do not change unnecessarily >>>>> id -9 class hdd # do not change unnecessarily >>>>> # weight 2.046 >>>>> alg straw >>>>> hash 0 # rjenkins1 >>>>> item osd.9 weight 0.227 >>>>> item osd.10 weight 0.910 >>>>> item osd.11 weight 0.910 >>>>> } >>>>> host px-echo-cluster { >>>>> id -11 # do not change unnecessarily >>>>> id -12 class hdd # do not change unnecessarily >>>>> # weight 2.725 >>>>> alg straw2 >>>>> hash 0 # rjenkins1 >>>>> item osd.12 weight 0.454 >>>>> item osd.13 weight 0.454 >>>>> item osd.14 weight 0.454 >>>>> item osd.16 weight 0.454 >>>>> item osd.17 weight 0.454 >>>>> item osd.15 weight 0.454 >>>>> } >>>>> host px-foxtrott-cluster { >>>>> id -13 # do not change unnecessarily >>>>> id -14 class hdd # do not change unnecessarily >>>>> # weight 3.180 >>>>> alg straw2 >>>>> hash 0 # rjenkins1 >>>>> item osd.18 weight 0.454 >>>>> item osd.19 weight 0.454 >>>>> item osd.20 weight 0.454 >>>>> item osd.22 weight 0.454 >>>>> item osd.23 weight 0.454 >>>>> item osd.21 weight 0.909 >>>>> } >>>>> root default { >>>>> id -1 # do not change unnecessarily >>>>> id -10 class hdd # do not change unnecessarily >>>>> # weight 12.727 >>>>> alg straw >>>>> hash 0 # rjenkins1 >>>>> item px-alpha-cluster weight 1.364 >>>>> item px-bravo-cluster weight 1.364 >>>>> item px-charlie-cluster weight 2.046 >>>>> item px-delta-cluster weight 2.046 >>>>> item px-echo-cluster weight 2.725 >>>>> item px-foxtrott-cluster weight 3.180 >>>>> } >>>>> >>>>> # rules >>>>> rule replicated_ruleset { >>>>> id 0 >>>>> type replicated >>>>> min_size 1 >>>>> max_size 10 >>>>> step take default >>>>> step chooseleaf firstn 0 type host >>>>> step emit >>>>> } >>>>> >>>>> # end crush map >>>>> #### End CRUSH #### >>