Sage, Dongdong, I'll be out of office for the next two weeks and thus won't follow the issue. As already told it is working ok with kernel 4.13 which is currently running again. What I did so far is to configure all hosts to use MTU 1500 and the issue still occures. I don't have access to the switch where four of the hosts connect (Cisco Nexus) and cannot disable LACP. The Nexus won't allow non-LACP connections so I was only able to unconfigure the bonding interface on the other two hosts that connect to an HP switch. Someone else suggested Spectre mitigations as a cause. Do you have an idea how to prove/disprove this with Ceph itself (not only by disabling the mitigations -> timeouts in logs, etc.)? Regards, Uwe Am 25. August 2018 16:37:13 MESZ schrieb "陶冬冬" <tdd21151186@xxxxxxxxx>: >Hi Sage, > >It turned out be to a Jumbo frame issue eventually. after we fixed the >MTU issue. >there is no slow request anymore, and no osd op thread timeout/suicide >anymore. >that’s the weird thing to me, it’s hard to see any connection between >the MTU and osd op thread or filestore thread. > >Thanks, >Dongdong > >> 在 2018年8月25日,下午10:26,Sage Weil <sage@xxxxxxxxxxxx> 写道: >> >> Hi Dongdong, >> >> I think you're right--if the op thread is stuck then it's not a >networking >> issue. My next guess would be that there is an object that the >backend is >> stuck processing, like an rgw index object with too many object >entries. >> If you turn up debug filestore = 20 on the running daemon while it is > >> stuck (ceph daemon osd.NNN config set debug_filestore 20) you might >see >> output in the log indicating what it is working on. >> >> sage >> >> >> >> On Wed, 22 Aug 2018, 陶冬冬 wrote: >> >>> Hey Sage, >>> >>> I just saw your comments about Jumbo frame might cause this kind >slow request. >>> we just met this kind issue a few weeks ago. but there is also osd >op thread timeout and even reaches suicide timeout. >>> I did trace the osd log with 20 severity. from what i can see, >that timeout thread is just not get executed for these time. >>> that leads corresponding slow request is stucked in event >“queued_for_pg”. >>> it is very strange to me why the misconfigured mtu on the network >side would cause the osd op thread timeout ? >>> >>> Thanks & Regards, >>> Dongdong >>>> 在 2018年8月17日,下午8:29,Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道: >>>> >>>> Am 17.08.18 um 14:23 schrieb Sage Weil: >>>>> On Fri, 17 Aug 2018, Uwe Sauter wrote: >>>>>> >>>>>> Dear devs, >>>>>> >>>>>> I'm posting on ceph-devel because I didn't get any feedback on >ceph-users. This is an act of desperation… >>>>>> >>>>>> >>>>>> >>>>>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests >with Kernel 4.15. How to debug? >>>>>> >>>>>> >>>>>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts >of 2 different kinds (details at the end). >>>>>> The main difference between those hosts is CPU generation >(Westmere / Sandybridge), and number of OSD disks. >>>>>> >>>>>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but >using Ubuntu kernels and the Proxmox >>>>>> virtualization framework. The Proxmox WebUI also integrates some >kind of Ceph management. >>>>>> >>>>>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and >OSDs while the other 3 only run OSDs. >>>>>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. >All OSDs are BlueStore. >>>>>> >>>>>> >>>>>> Now here's the thing: >>>>>> >>>>>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since >then I'm getting slow requests that >>>>>> cause blocked IO inside the VMs that are running on the cluster >(but not necessarily on the host >>>>>> with the OSD causing the slow request). >>>>>> >>>>>> If I boot back into 4.13 then Ceph runs smoothly again. >>>>>> >>>>>> >>>>>> I'm seeking for help to debug this issue as I'm running out of >ideas what I could else do. >>>>>> So far I was using "ceph daemon osd.X dump_blocked_ops"to >diagnose which always indicates that the >>>>>> primary OSD scheduled copies on two secondaries (e.g. OSD 15: >"event": "waiting for subops from 9,23") >>>>>> but only one of those succeeds ("event": "sub_op_commit_rec from >23"). The other one blocks (there is >>>>>> no commit message from OSD 9). >>>>>> >>>>>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) >which confuses me a lot. If this OSD >>>>>> does not commit there should be an operation that does not >succeed, should it not? >>>>>> >>>>>> Restarting the (primary) OSD with the blocked operation clears >the error, restarting the secondary OSD that >>>>>> does not commit has no effect on the issue. >>>>>> >>>>>> >>>>>> Any ideas on how to debug this further? What should I do to >identify this as a Ceph issue and not >>>>>> a networking or kernel issue? >>>>> >>>>> This kind of issue has usually turned out to be a networking issue >in the >>>>> past (either kernel or hardware, or some combinatino of hte two). >I would >>>>> suggest adding debug_ms=1 and reproducing and see if the >replicated op >>>>> makes it to the blocked replica. It sounds like it isn't.. in >which case >>>>> cranking it up to debug_ms=20 and reproducing will should you more >about >>>>> when ceph is reading data off the socket and when it isn't. And >while it >>>>> is stuck you can identify teh fd involved, checking the socket >status with >>>>> netstat, see if the 'data waiting flag' is set or not, and so on. >>>>> >>>>> But times when we've gotten to that level it's (I think) always >ended up >>>>> being either jumbo fram eissues with the network hardware or >problems >>>>> with, say, bonding. I'm not sure how the kernel version might >have >>>>> affected the hosts interaction with the network but it seems like >it's >>>>> possible... >>>>> >>>>> sage >>>>> >>>> >>>> Sage, >>>> >>>> thanks for those suggestions. I'll try next week and get back. You >are right about jumbo frames and bonding (which I forgot to >>>> mention). >>>> >>>> Just to make sure I understand correctly: >>>> >>>> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf? >>>> - And the effect is that there will be debug output in the log >files? And even more, when set to 20? >>>> >>>> >>>> Have a nice weekend, >>>> >>>> Uwe >>>> >>>> >>>>> >>>>>> >>>>>> >>>>>> I can provide more specific info if needed. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Uwe >>>>>> >>>>>> >>>>>> >>>>>> #### Hardware details #### >>>>>> Host type 1: >>>>>> CPU: 2x Intel Xeon E5-2670 >>>>>> RAM: 64GiB >>>>>> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by >931GiB) >>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x >10GbE Myricom (Ceph & KVM, MTU 9000) >>>>>> >>>>>> Host type 2: >>>>>> CPU: 2x Intel Xeon E5606 >>>>>> RAM: 96GiB >>>>>> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by >931GiB) >>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x >10GbE Myricom (Ceph & KVM, MTU 9000) >>>>>> #### End Hardware #### >>>>>> >>>>>> #### Ceph OSD Tree #### >>>>>> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT >PRI-AFF >>>>>> -1 12.72653 root default >>>>>> -2 1.36418 host px-alpha-cluster >>>>>> 0 hdd 0.22729 osd.0 up 1.00000 >1.00000 >>>>>> 1 hdd 0.22729 osd.1 up 1.00000 >1.00000 >>>>>> 2 hdd 0.90959 osd.2 up 1.00000 >1.00000 >>>>>> -3 1.36418 host px-bravo-cluster >>>>>> 3 hdd 0.22729 osd.3 up 1.00000 >1.00000 >>>>>> 4 hdd 0.22729 osd.4 up 1.00000 >1.00000 >>>>>> 5 hdd 0.90959 osd.5 up 1.00000 >1.00000 >>>>>> -4 2.04648 host px-charlie-cluster >>>>>> 6 hdd 0.90959 osd.6 up 1.00000 >1.00000 >>>>>> 7 hdd 0.22729 osd.7 up 1.00000 >1.00000 >>>>>> 8 hdd 0.90959 osd.8 up 1.00000 >1.00000 >>>>>> -5 2.04648 host px-delta-cluster >>>>>> 9 hdd 0.22729 osd.9 up 1.00000 >1.00000 >>>>>> 10 hdd 0.90959 osd.10 up 1.00000 >1.00000 >>>>>> 11 hdd 0.90959 osd.11 up 1.00000 >1.00000 >>>>>> -11 2.72516 host px-echo-cluster >>>>>> 12 hdd 0.45419 osd.12 up 1.00000 >1.00000 >>>>>> 13 hdd 0.45419 osd.13 up 1.00000 >1.00000 >>>>>> 14 hdd 0.45419 osd.14 up 1.00000 >1.00000 >>>>>> 15 hdd 0.45419 osd.15 up 1.00000 >1.00000 >>>>>> 16 hdd 0.45419 osd.16 up 1.00000 >1.00000 >>>>>> 17 hdd 0.45419 osd.17 up 1.00000 >1.00000 >>>>>> -13 3.18005 host px-foxtrott-cluster >>>>>> 18 hdd 0.45419 osd.18 up 1.00000 >1.00000 >>>>>> 19 hdd 0.45419 osd.19 up 1.00000 >1.00000 >>>>>> 20 hdd 0.45419 osd.20 up 1.00000 >1.00000 >>>>>> 21 hdd 0.90909 osd.21 up 1.00000 >1.00000 >>>>>> 22 hdd 0.45419 osd.22 up 1.00000 >1.00000 >>>>>> 23 hdd 0.45419 osd.23 up 1.00000 >1.00000 >>>>>> #### End OSD Tree #### >>>>>> >>>>>> #### CRUSH map #### >>>>>> # begin crush map >>>>>> tunable choose_local_tries 0 >>>>>> tunable choose_local_fallback_tries 0 >>>>>> tunable choose_total_tries 50 >>>>>> tunable chooseleaf_descend_once 1 >>>>>> tunable chooseleaf_vary_r 1 >>>>>> tunable chooseleaf_stable 1 >>>>>> tunable straw_calc_version 1 >>>>>> tunable allowed_bucket_algs 54 >>>>>> >>>>>> # devices >>>>>> device 0 osd.0 class hdd >>>>>> device 1 osd.1 class hdd >>>>>> device 2 osd.2 class hdd >>>>>> device 3 osd.3 class hdd >>>>>> device 4 osd.4 class hdd >>>>>> device 5 osd.5 class hdd >>>>>> device 6 osd.6 class hdd >>>>>> device 7 osd.7 class hdd >>>>>> device 8 osd.8 class hdd >>>>>> device 9 osd.9 class hdd >>>>>> device 10 osd.10 class hdd >>>>>> device 11 osd.11 class hdd >>>>>> device 12 osd.12 class hdd >>>>>> device 13 osd.13 class hdd >>>>>> device 14 osd.14 class hdd >>>>>> device 15 osd.15 class hdd >>>>>> device 16 osd.16 class hdd >>>>>> device 17 osd.17 class hdd >>>>>> device 18 osd.18 class hdd >>>>>> device 19 osd.19 class hdd >>>>>> device 20 osd.20 class hdd >>>>>> device 21 osd.21 class hdd >>>>>> device 22 osd.22 class hdd >>>>>> device 23 osd.23 class hdd >>>>>> >>>>>> # types >>>>>> type 0 osd >>>>>> type 1 host >>>>>> type 2 chassis >>>>>> type 3 rack >>>>>> type 4 row >>>>>> type 5 pdu >>>>>> type 6 pod >>>>>> type 7 room >>>>>> type 8 datacenter >>>>>> type 9 region >>>>>> type 10 root >>>>>> >>>>>> # buckets >>>>>> host px-alpha-cluster { >>>>>> id -2 # do not change unnecessarily >>>>>> id -6 class hdd # do not change unnecessarily >>>>>> # weight 1.364 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.0 weight 0.227 >>>>>> item osd.1 weight 0.227 >>>>>> item osd.2 weight 0.910 >>>>>> } >>>>>> host px-bravo-cluster { >>>>>> id -3 # do not change unnecessarily >>>>>> id -7 class hdd # do not change unnecessarily >>>>>> # weight 1.364 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.3 weight 0.227 >>>>>> item osd.4 weight 0.227 >>>>>> item osd.5 weight 0.910 >>>>>> } >>>>>> host px-charlie-cluster { >>>>>> id -4 # do not change unnecessarily >>>>>> id -8 class hdd # do not change unnecessarily >>>>>> # weight 2.046 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.7 weight 0.227 >>>>>> item osd.8 weight 0.910 >>>>>> item osd.6 weight 0.910 >>>>>> } >>>>>> host px-delta-cluster { >>>>>> id -5 # do not change unnecessarily >>>>>> id -9 class hdd # do not change unnecessarily >>>>>> # weight 2.046 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item osd.9 weight 0.227 >>>>>> item osd.10 weight 0.910 >>>>>> item osd.11 weight 0.910 >>>>>> } >>>>>> host px-echo-cluster { >>>>>> id -11 # do not change unnecessarily >>>>>> id -12 class hdd # do not change unnecessarily >>>>>> # weight 2.725 >>>>>> alg straw2 >>>>>> hash 0 # rjenkins1 >>>>>> item osd.12 weight 0.454 >>>>>> item osd.13 weight 0.454 >>>>>> item osd.14 weight 0.454 >>>>>> item osd.16 weight 0.454 >>>>>> item osd.17 weight 0.454 >>>>>> item osd.15 weight 0.454 >>>>>> } >>>>>> host px-foxtrott-cluster { >>>>>> id -13 # do not change unnecessarily >>>>>> id -14 class hdd # do not change unnecessarily >>>>>> # weight 3.180 >>>>>> alg straw2 >>>>>> hash 0 # rjenkins1 >>>>>> item osd.18 weight 0.454 >>>>>> item osd.19 weight 0.454 >>>>>> item osd.20 weight 0.454 >>>>>> item osd.22 weight 0.454 >>>>>> item osd.23 weight 0.454 >>>>>> item osd.21 weight 0.909 >>>>>> } >>>>>> root default { >>>>>> id -1 # do not change unnecessarily >>>>>> id -10 class hdd # do not change unnecessarily >>>>>> # weight 12.727 >>>>>> alg straw >>>>>> hash 0 # rjenkins1 >>>>>> item px-alpha-cluster weight 1.364 >>>>>> item px-bravo-cluster weight 1.364 >>>>>> item px-charlie-cluster weight 2.046 >>>>>> item px-delta-cluster weight 2.046 >>>>>> item px-echo-cluster weight 2.725 >>>>>> item px-foxtrott-cluster weight 3.180 >>>>>> } >>>>>> >>>>>> # rules >>>>>> rule replicated_ruleset { >>>>>> id 0 >>>>>> type replicated >>>>>> min_size 1 >>>>>> max_size 10 >>>>>> step take default >>>>>> step chooseleaf firstn 0 type host >>>>>> step emit >>>>>> } >>>>>> >>>>>> # end crush map >>>>>> #### End CRUSH #### >>> -- Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.