Re: Help needed for diagnosing slow_requests

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Sage, Dongdong,

I'll be out of office for the next two weeks and thus won't follow the issue. As already told it is working ok with kernel 4.13 which is currently running again.

What I did so far is to configure all hosts to use MTU 1500 and the issue still occures. I don't have access to the switch where four of the hosts connect (Cisco Nexus) and cannot disable LACP. The Nexus won't allow non-LACP connections so I was only able to unconfigure the bonding interface on the other two hosts that connect to an HP switch.

Someone else suggested Spectre mitigations as a cause. Do you have an idea how to prove/disprove this with Ceph itself (not only by disabling the mitigations -> timeouts in logs, etc.)?


Regards,

Uwe

Am 25. August 2018 16:37:13 MESZ schrieb "陶冬冬" <tdd21151186@xxxxxxxxx>:
>Hi Sage,
>
>It turned out be to a Jumbo frame issue eventually. after we fixed the
>MTU issue.
>there is no slow request anymore, and no osd op thread timeout/suicide
>anymore.
>that’s the weird thing to me, it’s hard to see any connection between
>the MTU and osd op thread or filestore thread.
>
>Thanks,
>Dongdong
>
>> 在 2018年8月25日,下午10:26,Sage Weil <sage@xxxxxxxxxxxx> 写道:
>> 
>> Hi Dongdong,
>> 
>> I think you're right--if the op thread is stuck then it's not a
>networking 
>> issue.  My next guess would be that there is an object that the
>backend is 
>> stuck processing, like an rgw index object with too many object
>entries.  
>> If you turn up debug filestore = 20 on the running daemon while it is
>
>> stuck (ceph daemon osd.NNN config set debug_filestore 20) you might
>see 
>> output in the log indicating what it is working on.
>> 
>> sage
>> 
>> 
>> 
>> On Wed, 22 Aug 2018, 陶冬冬 wrote:
>> 
>>> Hey Sage, 
>>> 
>>> I just saw your comments about Jumbo frame might cause this kind
>slow request.
>>> we just met this kind issue a few weeks ago. but there is also osd
>op thread timeout and even reaches suicide timeout.
>>> I did trace the osd log with 20 severity.  from what i can see, 
>that timeout thread is just not get executed for these time.
>>> that leads corresponding slow request is stucked in event
>“queued_for_pg”.
>>> it is very strange to me why the misconfigured mtu on the network
>side would cause the osd op thread timeout ?
>>> 
>>> Thanks & Regards,
>>> Dongdong
>>>> 在 2018年8月17日,下午8:29,Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道:
>>>> 
>>>> Am 17.08.18 um 14:23 schrieb Sage Weil:
>>>>> On Fri, 17 Aug 2018, Uwe Sauter wrote:
>>>>>> 
>>>>>> Dear devs,
>>>>>> 
>>>>>> I'm posting on ceph-devel because I didn't get any feedback on
>ceph-users. This is an act of desperation…
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests
>with Kernel 4.15. How to debug?
>>>>>> 
>>>>>> 
>>>>>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts
>of 2 different kinds (details at the end).
>>>>>> The main difference between those hosts is CPU generation
>(Westmere / Sandybridge),  and number of OSD disks.
>>>>>> 
>>>>>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but
>using Ubuntu kernels and the Proxmox
>>>>>> virtualization framework. The Proxmox WebUI also integrates some
>kind of Ceph management.
>>>>>> 
>>>>>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and
>OSDs while the other 3 only run OSDs.
>>>>>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7.
>All OSDs are BlueStore.
>>>>>> 
>>>>>> 
>>>>>> Now here's the thing:
>>>>>> 
>>>>>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since
>then I'm getting slow requests that
>>>>>> cause blocked IO inside the VMs that are running on the cluster
>(but not necessarily on the host
>>>>>> with the OSD causing the slow request).
>>>>>> 
>>>>>> If I boot back into 4.13 then Ceph runs smoothly again.
>>>>>> 
>>>>>> 
>>>>>> I'm seeking for help to debug this issue as I'm running out of
>ideas what I could else do.
>>>>>> So far I was using "ceph daemon osd.X dump_blocked_ops"to
>diagnose which always indicates that the
>>>>>> primary OSD scheduled copies on two secondaries (e.g. OSD 15:
>"event": "waiting for subops from 9,23")
>>>>>> but only one of those succeeds ("event": "sub_op_commit_rec from
>23"). The other one blocks (there is
>>>>>> no commit message from OSD 9).
>>>>>> 
>>>>>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0)
>which confuses me a lot. If this OSD
>>>>>> does not commit there should be an operation that does not
>succeed, should it not?
>>>>>> 
>>>>>> Restarting the (primary) OSD with the blocked operation clears
>the error, restarting the secondary OSD that
>>>>>> does not commit has no effect on the issue.
>>>>>> 
>>>>>> 
>>>>>> Any ideas on how to debug this further? What should I do to
>identify this as a Ceph issue and not
>>>>>> a networking or kernel issue?
>>>>> 
>>>>> This kind of issue has usually turned out to be a networking issue
>in the 
>>>>> past (either kernel or hardware, or some combinatino of hte two). 
>I would 
>>>>> suggest adding debug_ms=1 and reproducing and see if the
>replicated op 
>>>>> makes it to the blocked replica.  It sounds like it isn't.. in
>which case 
>>>>> cranking it up to debug_ms=20 and reproducing will should you more
>about 
>>>>> when ceph is reading data off the socket and when it isn't.  And
>while it 
>>>>> is stuck you can identify teh fd involved, checking the socket
>status with 
>>>>> netstat, see if the 'data waiting flag' is set or not, and so on.
>>>>> 
>>>>> But times when we've gotten to that level it's (I think) always
>ended up 
>>>>> being either jumbo fram eissues with the network hardware or
>problems 
>>>>> with, say, bonding.  I'm not sure how the kernel version might
>have 
>>>>> affected the hosts interaction with the network but it seems like
>it's 
>>>>> possible...
>>>>> 
>>>>> sage
>>>>> 
>>>> 
>>>> Sage,
>>>> 
>>>> thanks for those suggestions. I'll try next week and get back. You
>are right about jumbo frames and bonding (which I forgot to
>>>> mention).
>>>> 
>>>> Just to make sure I understand correctly:
>>>> 
>>>> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
>>>> - And the effect is that there will be debug output in the log
>files? And even more, when set to 20?
>>>> 
>>>> 
>>>> Have a nice weekend,
>>>> 
>>>> 	Uwe
>>>> 
>>>> 
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> I can provide more specific info if needed.
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Uwe
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> #### Hardware details ####
>>>>>> Host type 1:
>>>>>> CPU: 2x Intel Xeon E5-2670
>>>>>> RAM: 64GiB
>>>>>> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by
>931GiB)
>>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x
>10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>>> 
>>>>>> Host type 2:
>>>>>> CPU: 2x Intel Xeon E5606
>>>>>> RAM: 96GiB
>>>>>> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by
>931GiB)
>>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x
>10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>>> #### End Hardware ####
>>>>>> 
>>>>>> #### Ceph OSD Tree ####
>>>>>> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT
>PRI-AFF
>>>>>> -1       12.72653 root default
>>>>>> -2        1.36418     host px-alpha-cluster
>>>>>> 0   hdd  0.22729         osd.0                    up  1.00000
>1.00000
>>>>>> 1   hdd  0.22729         osd.1                    up  1.00000
>1.00000
>>>>>> 2   hdd  0.90959         osd.2                    up  1.00000
>1.00000
>>>>>> -3        1.36418     host px-bravo-cluster
>>>>>> 3   hdd  0.22729         osd.3                    up  1.00000
>1.00000
>>>>>> 4   hdd  0.22729         osd.4                    up  1.00000
>1.00000
>>>>>> 5   hdd  0.90959         osd.5                    up  1.00000
>1.00000
>>>>>> -4        2.04648     host px-charlie-cluster
>>>>>> 6   hdd  0.90959         osd.6                    up  1.00000
>1.00000
>>>>>> 7   hdd  0.22729         osd.7                    up  1.00000
>1.00000
>>>>>> 8   hdd  0.90959         osd.8                    up  1.00000
>1.00000
>>>>>> -5        2.04648     host px-delta-cluster
>>>>>> 9   hdd  0.22729         osd.9                    up  1.00000
>1.00000
>>>>>> 10   hdd  0.90959         osd.10                   up  1.00000
>1.00000
>>>>>> 11   hdd  0.90959         osd.11                   up  1.00000
>1.00000
>>>>>> -11        2.72516     host px-echo-cluster
>>>>>> 12   hdd  0.45419         osd.12                   up  1.00000
>1.00000
>>>>>> 13   hdd  0.45419         osd.13                   up  1.00000
>1.00000
>>>>>> 14   hdd  0.45419         osd.14                   up  1.00000
>1.00000
>>>>>> 15   hdd  0.45419         osd.15                   up  1.00000
>1.00000
>>>>>> 16   hdd  0.45419         osd.16                   up  1.00000
>1.00000
>>>>>> 17   hdd  0.45419         osd.17                   up  1.00000
>1.00000
>>>>>> -13        3.18005     host px-foxtrott-cluster
>>>>>> 18   hdd  0.45419         osd.18                   up  1.00000
>1.00000
>>>>>> 19   hdd  0.45419         osd.19                   up  1.00000
>1.00000
>>>>>> 20   hdd  0.45419         osd.20                   up  1.00000
>1.00000
>>>>>> 21   hdd  0.90909         osd.21                   up  1.00000
>1.00000
>>>>>> 22   hdd  0.45419         osd.22                   up  1.00000
>1.00000
>>>>>> 23   hdd  0.45419         osd.23                   up  1.00000
>1.00000
>>>>>> #### End OSD Tree ####
>>>>>> 
>>>>>> #### CRUSH map ####
>>>>>> # begin crush map
>>>>>> tunable choose_local_tries 0
>>>>>> tunable choose_local_fallback_tries 0
>>>>>> tunable choose_total_tries 50
>>>>>> tunable chooseleaf_descend_once 1
>>>>>> tunable chooseleaf_vary_r 1
>>>>>> tunable chooseleaf_stable 1
>>>>>> tunable straw_calc_version 1
>>>>>> tunable allowed_bucket_algs 54
>>>>>> 
>>>>>> # devices
>>>>>> device 0 osd.0 class hdd
>>>>>> device 1 osd.1 class hdd
>>>>>> device 2 osd.2 class hdd
>>>>>> device 3 osd.3 class hdd
>>>>>> device 4 osd.4 class hdd
>>>>>> device 5 osd.5 class hdd
>>>>>> device 6 osd.6 class hdd
>>>>>> device 7 osd.7 class hdd
>>>>>> device 8 osd.8 class hdd
>>>>>> device 9 osd.9 class hdd
>>>>>> device 10 osd.10 class hdd
>>>>>> device 11 osd.11 class hdd
>>>>>> device 12 osd.12 class hdd
>>>>>> device 13 osd.13 class hdd
>>>>>> device 14 osd.14 class hdd
>>>>>> device 15 osd.15 class hdd
>>>>>> device 16 osd.16 class hdd
>>>>>> device 17 osd.17 class hdd
>>>>>> device 18 osd.18 class hdd
>>>>>> device 19 osd.19 class hdd
>>>>>> device 20 osd.20 class hdd
>>>>>> device 21 osd.21 class hdd
>>>>>> device 22 osd.22 class hdd
>>>>>> device 23 osd.23 class hdd
>>>>>> 
>>>>>> # types
>>>>>> type 0 osd
>>>>>> type 1 host
>>>>>> type 2 chassis
>>>>>> type 3 rack
>>>>>> type 4 row
>>>>>> type 5 pdu
>>>>>> type 6 pod
>>>>>> type 7 room
>>>>>> type 8 datacenter
>>>>>> type 9 region
>>>>>> type 10 root
>>>>>> 
>>>>>> # buckets
>>>>>> host px-alpha-cluster {
>>>>>> id -2   # do not change unnecessarily
>>>>>> id -6 class hdd   # do not change unnecessarily
>>>>>> # weight 1.364
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.0 weight 0.227
>>>>>> item osd.1 weight 0.227
>>>>>> item osd.2 weight 0.910
>>>>>> }
>>>>>> host px-bravo-cluster {
>>>>>> id -3   # do not change unnecessarily
>>>>>> id -7 class hdd   # do not change unnecessarily
>>>>>> # weight 1.364
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.3 weight 0.227
>>>>>> item osd.4 weight 0.227
>>>>>> item osd.5 weight 0.910
>>>>>> }
>>>>>> host px-charlie-cluster {
>>>>>> id -4   # do not change unnecessarily
>>>>>> id -8 class hdd   # do not change unnecessarily
>>>>>> # weight 2.046
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.7 weight 0.227
>>>>>> item osd.8 weight 0.910
>>>>>> item osd.6 weight 0.910
>>>>>> }
>>>>>> host px-delta-cluster {
>>>>>> id -5   # do not change unnecessarily
>>>>>> id -9 class hdd   # do not change unnecessarily
>>>>>> # weight 2.046
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.9 weight 0.227
>>>>>> item osd.10 weight 0.910
>>>>>> item osd.11 weight 0.910
>>>>>> }
>>>>>> host px-echo-cluster {
>>>>>> id -11    # do not change unnecessarily
>>>>>> id -12 class hdd    # do not change unnecessarily
>>>>>> # weight 2.725
>>>>>> alg straw2
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.12 weight 0.454
>>>>>> item osd.13 weight 0.454
>>>>>> item osd.14 weight 0.454
>>>>>> item osd.16 weight 0.454
>>>>>> item osd.17 weight 0.454
>>>>>> item osd.15 weight 0.454
>>>>>> }
>>>>>> host px-foxtrott-cluster {
>>>>>> id -13    # do not change unnecessarily
>>>>>> id -14 class hdd    # do not change unnecessarily
>>>>>> # weight 3.180
>>>>>> alg straw2
>>>>>> hash 0  # rjenkins1
>>>>>> item osd.18 weight 0.454
>>>>>> item osd.19 weight 0.454
>>>>>> item osd.20 weight 0.454
>>>>>> item osd.22 weight 0.454
>>>>>> item osd.23 weight 0.454
>>>>>> item osd.21 weight 0.909
>>>>>> }
>>>>>> root default {
>>>>>> id -1   # do not change unnecessarily
>>>>>> id -10 class hdd    # do not change unnecessarily
>>>>>> # weight 12.727
>>>>>> alg straw
>>>>>> hash 0  # rjenkins1
>>>>>> item px-alpha-cluster weight 1.364
>>>>>> item px-bravo-cluster weight 1.364
>>>>>> item px-charlie-cluster weight 2.046
>>>>>> item px-delta-cluster weight 2.046
>>>>>> item px-echo-cluster weight 2.725
>>>>>> item px-foxtrott-cluster weight 3.180
>>>>>> }
>>>>>> 
>>>>>> # rules
>>>>>> rule replicated_ruleset {
>>>>>> id 0
>>>>>> type replicated
>>>>>> min_size 1
>>>>>> max_size 10
>>>>>> step take default
>>>>>> step chooseleaf firstn 0 type host
>>>>>> step emit
>>>>>> }
>>>>>> 
>>>>>> # end crush map
>>>>>> #### End CRUSH ####
>>> 

-- 
Diese Nachricht wurde von meinem Android-Gerät mit K-9 Mail gesendet.




[Index of Archives]     [CEPH Users]     [Ceph Large]     [Information on CEPH]     [Linux BTRFS]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux