Re: Help needed for diagnosing slow_requests

陶冬冬 <tdd21151186@xxxxxxxxx> · Sat, 25 Aug 2018 22:37:13 +0800

Hi Sage,

It turned out be to a Jumbo frame issue eventually. after we fixed the MTU issue.
there is no slow request anymore, and no osd op thread timeout/suicide anymore.
that’s the weird thing to me, it’s hard to see any connection between the MTU and osd op thread or filestore thread.

Thanks,
Dongdong

> 在 2018年8月25日，下午10:26，Sage Weil <sage@xxxxxxxxxxxx> 写道：
> 
> Hi Dongdong,
> 
> I think you're right--if the op thread is stuck then it's not a networking 
> issue.  My next guess would be that there is an object that the backend is 
> stuck processing, like an rgw index object with too many object entries.  
> If you turn up debug filestore = 20 on the running daemon while it is 
> stuck (ceph daemon osd.NNN config set debug_filestore 20) you might see 
> output in the log indicating what it is working on.
> 
> sage
> 
> 
> 
> On Wed, 22 Aug 2018, 陶冬冬 wrote:
> 
>> Hey Sage, 
>> 
>> I just saw your comments about Jumbo frame might cause this kind slow request.
>> we just met this kind issue a few weeks ago. but there is also osd op thread timeout and even reaches suicide timeout.
>> I did trace the osd log with 20 severity.  from what i can see,  that timeout thread is just not get executed for these time.
>> that leads corresponding slow request is stucked in event “queued_for_pg”.
>> it is very strange to me why the misconfigured mtu on the network side would cause the osd op thread timeout ?
>> 
>> Thanks & Regards,
>> Dongdong
>>> 在 2018年8月17日，下午8:29，Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道：
>>> 
>>> Am 17.08.18 um 14:23 schrieb Sage Weil:
>>>> On Fri, 17 Aug 2018, Uwe Sauter wrote:
>>>>> 
>>>>> Dear devs,
>>>>> 
>>>>> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation…
>>>>> 
>>>>> 
>>>>> 
>>>>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug?
>>>>> 
>>>>> 
>>>>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end).
>>>>> The main difference between those hosts is CPU generation (Westmere / Sandybridge),  and number of OSD disks.
>>>>> 
>>>>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox
>>>>> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management.
>>>>> 
>>>>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs.
>>>>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore.
>>>>> 
>>>>> 
>>>>> Now here's the thing:
>>>>> 
>>>>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that
>>>>> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host
>>>>> with the OSD causing the slow request).
>>>>> 
>>>>> If I boot back into 4.13 then Ceph runs smoothly again.
>>>>> 
>>>>> 
>>>>> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do.
>>>>> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the
>>>>> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23")
>>>>> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is
>>>>> no commit message from OSD 9).
>>>>> 
>>>>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD
>>>>> does not commit there should be an operation that does not succeed, should it not?
>>>>> 
>>>>> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that
>>>>> does not commit has no effect on the issue.
>>>>> 
>>>>> 
>>>>> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not
>>>>> a networking or kernel issue?
>>>> 
>>>> This kind of issue has usually turned out to be a networking issue in the 
>>>> past (either kernel or hardware, or some combinatino of hte two).  I would 
>>>> suggest adding debug_ms=1 and reproducing and see if the replicated op 
>>>> makes it to the blocked replica.  It sounds like it isn't.. in which case 
>>>> cranking it up to debug_ms=20 and reproducing will should you more about 
>>>> when ceph is reading data off the socket and when it isn't.  And while it 
>>>> is stuck you can identify teh fd involved, checking the socket status with 
>>>> netstat, see if the 'data waiting flag' is set or not, and so on.
>>>> 
>>>> But times when we've gotten to that level it's (I think) always ended up 
>>>> being either jumbo fram eissues with the network hardware or problems 
>>>> with, say, bonding.  I'm not sure how the kernel version might have 
>>>> affected the hosts interaction with the network but it seems like it's 
>>>> possible...
>>>> 
>>>> sage
>>>> 
>>> 
>>> Sage,
>>> 
>>> thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to
>>> mention).
>>> 
>>> Just to make sure I understand correctly:
>>> 
>>> - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
>>> - And the effect is that there will be debug output in the log files? And even more, when set to 20?
>>> 
>>> 
>>> Have a nice weekend,
>>> 
>>> 	Uwe
>>> 
>>> 
>>>> 
>>>>> 
>>>>> 
>>>>> I can provide more specific info if needed.
>>>>> 
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Uwe
>>>>> 
>>>>> 
>>>>> 
>>>>> #### Hardware details ####
>>>>> Host type 1:
>>>>> CPU: 2x Intel Xeon E5-2670
>>>>> RAM: 64GiB
>>>>> Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>> 
>>>>> Host type 2:
>>>>> CPU: 2x Intel Xeon E5606
>>>>> RAM: 96GiB
>>>>> Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
>>>>> connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
>>>>> #### End Hardware ####
>>>>> 
>>>>> #### Ceph OSD Tree ####
>>>>> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
>>>>> -1       12.72653 root default
>>>>> -2        1.36418     host px-alpha-cluster
>>>>> 0   hdd  0.22729         osd.0                    up  1.00000 1.00000
>>>>> 1   hdd  0.22729         osd.1                    up  1.00000 1.00000
>>>>> 2   hdd  0.90959         osd.2                    up  1.00000 1.00000
>>>>> -3        1.36418     host px-bravo-cluster
>>>>> 3   hdd  0.22729         osd.3                    up  1.00000 1.00000
>>>>> 4   hdd  0.22729         osd.4                    up  1.00000 1.00000
>>>>> 5   hdd  0.90959         osd.5                    up  1.00000 1.00000
>>>>> -4        2.04648     host px-charlie-cluster
>>>>> 6   hdd  0.90959         osd.6                    up  1.00000 1.00000
>>>>> 7   hdd  0.22729         osd.7                    up  1.00000 1.00000
>>>>> 8   hdd  0.90959         osd.8                    up  1.00000 1.00000
>>>>> -5        2.04648     host px-delta-cluster
>>>>> 9   hdd  0.22729         osd.9                    up  1.00000 1.00000
>>>>> 10   hdd  0.90959         osd.10                   up  1.00000 1.00000
>>>>> 11   hdd  0.90959         osd.11                   up  1.00000 1.00000
>>>>> -11        2.72516     host px-echo-cluster
>>>>> 12   hdd  0.45419         osd.12                   up  1.00000 1.00000
>>>>> 13   hdd  0.45419         osd.13                   up  1.00000 1.00000
>>>>> 14   hdd  0.45419         osd.14                   up  1.00000 1.00000
>>>>> 15   hdd  0.45419         osd.15                   up  1.00000 1.00000
>>>>> 16   hdd  0.45419         osd.16                   up  1.00000 1.00000
>>>>> 17   hdd  0.45419         osd.17                   up  1.00000 1.00000
>>>>> -13        3.18005     host px-foxtrott-cluster
>>>>> 18   hdd  0.45419         osd.18                   up  1.00000 1.00000
>>>>> 19   hdd  0.45419         osd.19                   up  1.00000 1.00000
>>>>> 20   hdd  0.45419         osd.20                   up  1.00000 1.00000
>>>>> 21   hdd  0.90909         osd.21                   up  1.00000 1.00000
>>>>> 22   hdd  0.45419         osd.22                   up  1.00000 1.00000
>>>>> 23   hdd  0.45419         osd.23                   up  1.00000 1.00000
>>>>> #### End OSD Tree ####
>>>>> 
>>>>> #### CRUSH map ####
>>>>> # begin crush map
>>>>> tunable choose_local_tries 0
>>>>> tunable choose_local_fallback_tries 0
>>>>> tunable choose_total_tries 50
>>>>> tunable chooseleaf_descend_once 1
>>>>> tunable chooseleaf_vary_r 1
>>>>> tunable chooseleaf_stable 1
>>>>> tunable straw_calc_version 1
>>>>> tunable allowed_bucket_algs 54
>>>>> 
>>>>> # devices
>>>>> device 0 osd.0 class hdd
>>>>> device 1 osd.1 class hdd
>>>>> device 2 osd.2 class hdd
>>>>> device 3 osd.3 class hdd
>>>>> device 4 osd.4 class hdd
>>>>> device 5 osd.5 class hdd
>>>>> device 6 osd.6 class hdd
>>>>> device 7 osd.7 class hdd
>>>>> device 8 osd.8 class hdd
>>>>> device 9 osd.9 class hdd
>>>>> device 10 osd.10 class hdd
>>>>> device 11 osd.11 class hdd
>>>>> device 12 osd.12 class hdd
>>>>> device 13 osd.13 class hdd
>>>>> device 14 osd.14 class hdd
>>>>> device 15 osd.15 class hdd
>>>>> device 16 osd.16 class hdd
>>>>> device 17 osd.17 class hdd
>>>>> device 18 osd.18 class hdd
>>>>> device 19 osd.19 class hdd
>>>>> device 20 osd.20 class hdd
>>>>> device 21 osd.21 class hdd
>>>>> device 22 osd.22 class hdd
>>>>> device 23 osd.23 class hdd
>>>>> 
>>>>> # types
>>>>> type 0 osd
>>>>> type 1 host
>>>>> type 2 chassis
>>>>> type 3 rack
>>>>> type 4 row
>>>>> type 5 pdu
>>>>> type 6 pod
>>>>> type 7 room
>>>>> type 8 datacenter
>>>>> type 9 region
>>>>> type 10 root
>>>>> 
>>>>> # buckets
>>>>> host px-alpha-cluster {
>>>>> id -2   # do not change unnecessarily
>>>>> id -6 class hdd   # do not change unnecessarily
>>>>> # weight 1.364
>>>>> alg straw
>>>>> hash 0  # rjenkins1
>>>>> item osd.0 weight 0.227
>>>>> item osd.1 weight 0.227
>>>>> item osd.2 weight 0.910
>>>>> }
>>>>> host px-bravo-cluster {
>>>>> id -3   # do not change unnecessarily
>>>>> id -7 class hdd   # do not change unnecessarily
>>>>> # weight 1.364
>>>>> alg straw
>>>>> hash 0  # rjenkins1
>>>>> item osd.3 weight 0.227
>>>>> item osd.4 weight 0.227
>>>>> item osd.5 weight 0.910
>>>>> }
>>>>> host px-charlie-cluster {
>>>>> id -4   # do not change unnecessarily
>>>>> id -8 class hdd   # do not change unnecessarily
>>>>> # weight 2.046
>>>>> alg straw
>>>>> hash 0  # rjenkins1
>>>>> item osd.7 weight 0.227
>>>>> item osd.8 weight 0.910
>>>>> item osd.6 weight 0.910
>>>>> }
>>>>> host px-delta-cluster {
>>>>> id -5   # do not change unnecessarily
>>>>> id -9 class hdd   # do not change unnecessarily
>>>>> # weight 2.046
>>>>> alg straw
>>>>> hash 0  # rjenkins1
>>>>> item osd.9 weight 0.227
>>>>> item osd.10 weight 0.910
>>>>> item osd.11 weight 0.910
>>>>> }
>>>>> host px-echo-cluster {
>>>>> id -11    # do not change unnecessarily
>>>>> id -12 class hdd    # do not change unnecessarily
>>>>> # weight 2.725
>>>>> alg straw2
>>>>> hash 0  # rjenkins1
>>>>> item osd.12 weight 0.454
>>>>> item osd.13 weight 0.454
>>>>> item osd.14 weight 0.454
>>>>> item osd.16 weight 0.454
>>>>> item osd.17 weight 0.454
>>>>> item osd.15 weight 0.454
>>>>> }
>>>>> host px-foxtrott-cluster {
>>>>> id -13    # do not change unnecessarily
>>>>> id -14 class hdd    # do not change unnecessarily
>>>>> # weight 3.180
>>>>> alg straw2
>>>>> hash 0  # rjenkins1
>>>>> item osd.18 weight 0.454
>>>>> item osd.19 weight 0.454
>>>>> item osd.20 weight 0.454
>>>>> item osd.22 weight 0.454
>>>>> item osd.23 weight 0.454
>>>>> item osd.21 weight 0.909
>>>>> }
>>>>> root default {
>>>>> id -1   # do not change unnecessarily
>>>>> id -10 class hdd    # do not change unnecessarily
>>>>> # weight 12.727
>>>>> alg straw
>>>>> hash 0  # rjenkins1
>>>>> item px-alpha-cluster weight 1.364
>>>>> item px-bravo-cluster weight 1.364
>>>>> item px-charlie-cluster weight 2.046
>>>>> item px-delta-cluster weight 2.046
>>>>> item px-echo-cluster weight 2.725
>>>>> item px-foxtrott-cluster weight 3.180
>>>>> }
>>>>> 
>>>>> # rules
>>>>> rule replicated_ruleset {
>>>>> id 0
>>>>> type replicated
>>>>> min_size 1
>>>>> max_size 10
>>>>> step take default
>>>>> step chooseleaf firstn 0 type host
>>>>> step emit
>>>>> }
>>>>> 
>>>>> # end crush map
>>>>> #### End CRUSH ####
>>