Re: Help needed for diagnosing slow_requests

Uwe Sauter <uwe.sauter.de@xxxxxxxxx> · Sat, 25 Aug 2018 21:24:05 +0200

Sage, Dongdong,

I'll be out of office for the next two weeks and thus won't follow the issue. As already told it is working ok with kernel 4.13 which is currently running again.

What I did so far is to configure all hosts to use MTU 1500 and the issue still occures. I don't have access to the switch where four of the hosts connect (Cisco Nexus) and cannot disable LACP. The Nexus won't allow non-LACP connections so I was only able to unconfigure the bonding interface on the other two hosts that connect to an HP switch.

Someone else suggested Spectre mitigations as a cause. Do you have an idea how to prove/disprove this with Ceph itself (not only by disabling the mitigations -> timeouts in logs, etc.)?

Regards,

Uwe

Am 25. August 2018 16:37:13 MESZ schrieb "陶冬冬" <tdd21151186@xxxxxxxxx>:
>Hi Sage,
>
>It turned out be to a Jumbo frame issue eventually. after we fixed the
>MTU issue.
>there is no slow request anymore, and no osd op thread timeout/suicide
>anymore.
>that’s the weird thing to me, it’s hard to see any connection between
>the MTU and osd op thread or filestore thread.
>
>Thanks,
>Dongdong
>
>> 在 2018年8月25日，下午10:26，Sage Weil <sage@xxxxxxxxxxxx> 写道：
>> 
>> Hi Dongdong,
>> 
>> I think you're right--if the op thread is stuck then it's not a
>networking 
>> issue.  My next guess would be that there is an object that the
>backend is 
>> stuck processing, like an rgw index object with too many object
>entries.  
>> If you turn up debug filestore = 20 on the running daemon while it is
>
>> stuck (ceph daemon osd.NNN config set debug_filestore 20) you might
>see 
>> output in the log indicating what it is working on.
>> 
>> sage
>> 
>> 
>> 
>> On Wed, 22 Aug 2018, 陶冬冬 wrote:
>> 
>>> Hey Sage, 
>>> 
>>> I just saw your comments about Jumbo frame might cause this kind
>slow request.
>>> we just met this kind issue a few weeks ago. but there is also osd
>op thread timeout and even reaches suicide timeout.
>>> I did trace the osd log with 20 severity.  from what i can see, 
>that timeout thread is just not get executed for these time.
>>> that leads corresponding slow request is stucked in event
>“queued_for_pg”.
>>> it is very strange to me why the misconfigured mtu on the network
>side would cause the osd op thread timeout ?
>>> 
>>> Thanks & Regards,
>>> Dongdong
>>>> 在 2018年8月17日，下午8:29，Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道：
>>>> 
>>>> Am 17.08.18 um 14:23 schrieb Sage Weil:
>>>>> On Fri, 17 Aug 2018, Uwe Sauter wrote:
>>>>>> 
>>>>>> Dear devs,
>>>>>> 
>>>>>> I'm posting on ceph-devel because I didn't get any feedback on
>ceph-users. This is an act of desperation…
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests
>with Kernel 4.15. How to debug?
>>>>>> 
>>>>>> 
>>>>>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts
>of 2 different kinds (details at the end).
>>>>>> The main difference between those hosts is CPU generation
>(Westmere / Sandybridge),  and number of OSD disks.
>>>>>> 
>>>>>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but
>using Ubuntu kernels and the Proxmox
>>>>>> virtualization framework. The Proxmox WebUI also integrates some
>kind of Ceph management.
>>>>>> 
>>>>>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and
>OSDs while the other 3 only run OSDs.
>>>>>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7.
>All OSDs are BlueStore.
>>>>>> 
>>>>>> 
>>>>>> Now here's the thing:
>>>>>> 
>>>>>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since
>then I'm getting slow requests that
>>>>>> cause blocked IO inside the VMs that are running on the cluster
>(but not necessarily on the host
>>>>>> with the OSD causing the slow request).
>>>>>> 
>>>>>> If I boot back into 4.13 then Ceph runs smoothly again.
>>>>>> 
>>>>>> 
>>>>>> I'm seeking for help to debug this issue as I'm running out of
>ideas what I could else do.
>>>>>> So far I was using "ceph daemon osd.X dump_blocked_ops"to
>diagnose which always indicates that the
>>>>>> primary OSD scheduled copies on two secondaries (e.g. OSD 15:
>"event": "waiting for subops from 9,23")
>>>>>> but only one of those succeeds ("event": "sub_op_commit_rec from
>23"). The other one blocks (there is
>>>>>> no commit message from OSD 9).
>>>>>> 
>>>>>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0)
>which confuses me a lot. If this OSD
>>>>>> does not commit there should be an operation that does not
>succeed, should it not?
>>>>>> 
>>>>>> Restarting the (primary) OSD with the blocked operation clears
>the error, restarting the secondary OSD that
>>>>>> does not commit has no effect on the issue.
>>>>>> 
>>>>>> 
>>>>>> Any ideas on how to debug this further? What should I do to
>identify this as a Ceph issue and not
>>>>>> a networking or kernel issue?
>>>>> 
>>>>> This kind of issue has usually turned out to be a networking issue
>in the 
>>>>> past (either kernel or hardware, or some combinatino of hte two). 
>I would 
>>>>> suggest adding debug_ms=1 and reproducing and see if the
>replicated op 
>>>>> makes it to the blocked replica.  It sounds like it isn't.. in
>which case 
>>>>> cranking it up to debug_ms=20 and reproducing will should you more
>about 
>>>>> when ceph is reading data off the socket and when it isn't.  And
>while it 
>>>>> is stuck you can identify teh fd involved, checking the socket
>status with 
>>>>> netstat, see if the 'data waiting flag' is set or not, and so on.
>>>>> 
>>>>> But times when we've gotten to that level it's (I think) always
>ended up 
>>>>> being either jumbo fram eissues with the network hardware or
>problems 
>>>>> with, say, bonding.  I'm not sure how the kernel version might
>have 
>>>>> affected the hosts interaction with the network but it seems like
>it's 
>>>>> possible...
>>>>> 
>>>>> sage
>>>>> 
>>>> 
>>>> Sage,
>>>> 
>>>> thanks for those suggestions. I'll try next week and get back. You
>are right about jumbo frames and bonding (which I forgot to
>>>> mention).
>>>> 
>>>> Just to make sure I understand correctly:
>>>> 
>>>> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
>>>> - And the effect is that there will be debug output in the log
>files? And even more, when set to 20?
>>>> 
>>>> 
>>>> Have a nice weekend,
>>>> 
>>>> 	Uwe
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I can provide more specific info if needed.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Uwe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #### Hardware details ####
>>>>>> Host type 1:
>>>>>> CPU: 2x Intel Xeon E5-2670
>>>>>> RAM: 64GiB
>>>>>> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by
>931GiB)
>>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x
>10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>>> 
>>>>>> Host type 2:
>>>>>> CPU: 2x Intel Xeon E5606
>>>>>> RAM: 96GiB
>>>>>> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by
>931GiB)
>>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x
>10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>>> #### End Hardware ####
>>>>>> 
>>>>>> #### Ceph OSD Tree ####
>>>>>> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT
>PRI-AFF
>>>>>> -1       12.72653 root default
>>>>>> -2        1.36418     host px-alpha-cluster
>>>>>> 0   hdd  0.22729         osd.0                    up  1.00000
>1.00000
>>>>>> 1   hdd  0.22729         osd.1                    up  1.00000
>1.00000
>>>>>> 2   hdd  0.90959         osd.2                    up  1.00000
>1.00000
>>>>>> -3        1.36418     host px-bravo-cluster
>>>>>> 3   hdd  0.22729         osd.3                    up  1.00000
>1.00000
>>>>>> 4   hdd  0.22729         osd.4                    up  1.00000
>1.00000
>>>>>> 5   hdd  0.90959         osd.5                    up  1.00000
>1.00000
>>>>>> -4        2.04648     host px-charlie-cluster
>>>>>> 6   hdd  0.90959         osd.6                    up  1.00000
>1.00000
>>>>>> 7   hdd  0.22729         osd.7                    up  1.00000
>1.00000
>>>>>> 8   hdd  0.90959         osd.8                    up  1.00000
>1.00000
>>>>>> -5        2.04648     host px-delta-cluster
>>>>>> 9   hdd  0.22729         osd.9                    up  1.00000
>1.00000
>>>>>> 10   hdd  0.90959         osd.10                   up  1.00000
>1.00000
>>>>>> 11   hdd  0.90959         osd.11                   up  1.00000
>1.00000
>>>>>> -11        2.72516     host px-echo-cluster
>>>>>> 12   hdd  0.45419         osd.12                   up  1.00000
>1.00000
>>>>>> 13   hdd  0.45419         osd.13                   up  1.00000
>1.00000
>>>>>> 14   hdd  0.45419         osd.14                   up  1.00000
>1.00000
>>>>>> 15   hdd  0.45419         osd.15                   up  1.00000
>1.00000
>>>>>> 16   hdd  0.45419         osd.16                   up  1.00000
>1.00000
>>>>>> 17   hdd  0.45419         osd.17                   up  1.00000
>1.00000
>>>>>> -13        3.18005     host px-foxtrott-cluster
>>>>>> 18   hdd  0.45419         osd.18                   up  1.00000
>1.00000
>>>>>> 19   hdd  0.45419         osd.19                   up  1.00000
>1.00000
>>>>>> 20   hdd  0.45419         osd.20                   up  1.00000
>1.00000
>>>>>> 21   hdd  0.90909         osd.21                   up  1.00000
>1.00000
>>>>>> 22   hdd  0.45419         osd.22                   up  1.00000
>1.00000
>>>>>> 23   hdd  0.45419         osd.23                   up  1.00000
>1.00000
>>>>>> #### End OSD Tree ####
>>>>>> 
>>>>>> #### CRUSH map ####
>>>>>> # begin crush map
>>>>>> tunable choose_local_tries 0
>>>>>> tunable choose_local_fallback_tries 0
>>>>>> tunable choose_total_tries 50
>>>>>> tunable chooseleaf_descend_once 1
>>>>>> tunable chooseleaf_vary_r 1
>>>>>> tunable chooseleaf_stable 1
>>>>>> tunable straw_calc_version 1
>>>>>> tunable allowed_bucket_algs 54
>>>>>> 
>>>>>> # devices
>>>>>> device 0 osd.0 class hdd
>>>>>> device 1 osd.1 class hdd
>>>>>> device 2 osd.2 class hdd
>>>>>> device 3 osd.3 class hdd
>>>>>> device 4 osd.4 class hdd
>>>>>> device 5 osd.5 class hdd
>>>>>> device 6 osd.6 class hdd
>>>>>> device 7 osd.7 class hdd
>>>>>> device 8 osd.8 class hdd
>>>>>> device 9 osd.9 class hdd
>>>>>> device 10 osd.10 class hdd
>>>>>> device 11 osd.11 class hdd
>>>>>> device 12 osd.12 class hdd
>>>>>> device 13 osd.13 class hdd
>>>>>> device 14 osd.14 class hdd
>>>>>> device 15 osd.15 class hdd
>>>>>> device 16 osd.16 class hdd
>>>>>> device 17 osd.17 class hdd
>>>>>> device 18 osd.18 class hdd
>>>>>> device 19 osd.19 class hdd
>>>>>> device 20 osd.20 class hdd
>>>>>> device 21 osd.21 class hdd
>>>>>> device 22 osd.22 class hdd
>>>>>> device 23 osd.23 class hdd
>>>>>> 
>>>>>> # types
>>>>>> type 0 osd
>>>>>> type 1 host
>>>>>> type 2 chassis
>>>>>> type 3 rack
>>>>>> type 4 row
>>>>>> type 5 pdu
>>>>>> type 6 pod
>>>>>> type 7 room
>>>>>> type 8 datacenter
>>>>>> type 9 region
>>>>>> type 10 root
>>>>>> 
>>>>>> # buckets
>>>>>> host px-alpha-cluster {
>>>>>> id -2   # do not change unnecessarily
>>>>>> id -6 class hdd   # do not change unnecessarily
>>>>>> # weight 1.364
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.0 weight 0.227
>>>>>> item osd.1 weight 0.227
>>>>>> item osd.2 weight 0.910
>>>>>> }
>>>>>> host px-bravo-cluster {
>>>>>> id -3   # do not change unnecessarily
>>>>>> id -7 class hdd   # do not change unnecessarily
>>>>>> # weight 1.364
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.3 weight 0.227
>>>>>> item osd.4 weight 0.227
>>>>>> item osd.5 weight 0.910
>>>>>> }
>>>>>> host px-charlie-cluster {
>>>>>> id -4   # do not change unnecessarily
>>>>>> id -8 class hdd   # do not change unnecessarily
>>>>>> # weight 2.046
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.7 weight 0.227
>>>>>> item osd.8 weight 0.910
>>>>>> item osd.6 weight 0.910
>>>>>> }
>>>>>> host px-delta-cluster {
>>>>>> id -5   # do not change unnecessarily
>>>>>> id -9 class hdd   # do not change unnecessarily
>>>>>> # weight 2.046
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.9 weight 0.227
>>>>>> item osd.10 weight 0.910
>>>>>> item osd.11 weight 0.910
>>>>>> }
>>>>>> host px-echo-cluster {
>>>>>> id -11    # do not change unnecessarily
>>>>>> id -12 class hdd    # do not change unnecessarily
>>>>>> # weight 2.725
>>>>>> alg straw2
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.12 weight 0.454
>>>>>> item osd.13 weight 0.454
>>>>>> item osd.14 weight 0.454
>>>>>> item osd.16 weight 0.454
>>>>>> item osd.17 weight 0.454
>>>>>> item osd.15 weight 0.454
>>>>>> }
>>>>>> host px-foxtrott-cluster {
>>>>>> id -13    # do not change unnecessarily
>>>>>> id -14 class hdd    # do not change unnecessarily
>>>>>> # weight 3.180
>>>>>> alg straw2
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.18 weight 0.454
>>>>>> item osd.19 weight 0.454
>>>>>> item osd.20 weight 0.454
>>>>>> item osd.22 weight 0.454
>>>>>> item osd.23 weight 0.454
>>>>>> item osd.21 weight 0.909
>>>>>> }
>>>>>> root default {
>>>>>> id -1   # do not change unnecessarily
>>>>>> id -10 class hdd    # do not change unnecessarily
>>>>>> # weight 12.727
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item px-alpha-cluster weight 1.364
>>>>>> item px-bravo-cluster weight 1.364
>>>>>> item px-charlie-cluster weight 2.046
>>>>>> item px-delta-cluster weight 2.046
>>>>>> item px-echo-cluster weight 2.725
>>>>>> item px-foxtrott-cluster weight 3.180
>>>>>> }
>>>>>> 
>>>>>> # rules
>>>>>> rule replicated_ruleset {
>>>>>> id 0
>>>>>> type replicated
>>>>>> min_size 1
>>>>>> max_size 10
>>>>>> step take default
>>>>>> step chooseleaf firstn 0 type host
>>>>>> step emit
>>>>>> }
>>>>>> 
>>>>>> # end crush map
>>>>>> #### End CRUSH ####
>>> 

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.