Re: Help needed for diagnosing slow_requests

Sage Weil <sage@xxxxxxxxxxxx> · Sat, 25 Aug 2018 14:26:23 +0000 (UTC)

Hi Dongdong,

I think you're right--if the op thread is stuck then it's not a networking 
issue.  My next guess would be that there is an object that the backend is 
stuck processing, like an rgw index object with too many object entries.  
If you turn up debug filestore = 20 on the running daemon while it is 
stuck (ceph daemon osd.NNN config set debug_filestore 20) you might see 
output in the log indicating what it is working on.

sage

On Wed, 22 Aug 2018, 陶冬冬 wrote:

> Hey Sage, 
> 
> I just saw your comments about Jumbo frame might cause this kind slow request.
> we just met this kind issue a few weeks ago. but there is also osd op thread timeout and even reaches suicide timeout.
> I did trace the osd log with 20 severity.  from what i can see,  that timeout thread is just not get executed for these time.
> that leads corresponding slow request is stucked in event “queued_for_pg”.
> it is very strange to me why the misconfigured mtu on the network side would cause the osd op thread timeout ?
> 
> Thanks & Regards,
> Dongdong
> > 在 2018年8月17日，下午8:29，Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道：
> > 
> > Am 17.08.18 um 14:23 schrieb Sage Weil:
> >> On Fri, 17 Aug 2018, Uwe Sauter wrote:
> >>> 
> >>> Dear devs,
> >>> 
> >>> I'm posting on ceph-devel because I didn't get any feedback on ceph-users. This is an act of desperation…
> >>> 
> >>> 
> >>> 
> >>> TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with Kernel 4.15. How to debug?
> >>> 
> >>> 
> >>> I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2 different kinds (details at the end).
> >>> The main difference between those hosts is CPU generation (Westmere / Sandybridge),  and number of OSD disks.
> >>> 
> >>> The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using Ubuntu kernels and the Proxmox
> >>> virtualization framework. The Proxmox WebUI also integrates some kind of Ceph management.
> >>> 
> >>> On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs while the other 3 only run OSDs.
> >>> OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All OSDs are BlueStore.
> >>> 
> >>> 
> >>> Now here's the thing:
> >>> 
> >>> Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then I'm getting slow requests that
> >>> cause blocked IO inside the VMs that are running on the cluster (but not necessarily on the host
> >>> with the OSD causing the slow request).
> >>> 
> >>> If I boot back into 4.13 then Ceph runs smoothly again.
> >>> 
> >>> 
> >>> I'm seeking for help to debug this issue as I'm running out of ideas what I could else do.
> >>> So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose which always indicates that the
> >>> primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event": "waiting for subops from 9,23")
> >>> but only one of those succeeds ("event": "sub_op_commit_rec from 23"). The other one blocks (there is
> >>> no commit message from OSD 9).
> >>> 
> >>> On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which confuses me a lot. If this OSD
> >>> does not commit there should be an operation that does not succeed, should it not?
> >>> 
> >>> Restarting the (primary) OSD with the blocked operation clears the error, restarting the secondary OSD that
> >>> does not commit has no effect on the issue.
> >>> 
> >>> 
> >>> Any ideas on how to debug this further? What should I do to identify this as a Ceph issue and not
> >>> a networking or kernel issue?
> >> 
> >> This kind of issue has usually turned out to be a networking issue in the 
> >> past (either kernel or hardware, or some combinatino of hte two).  I would 
> >> suggest adding debug_ms=1 and reproducing and see if the replicated op 
> >> makes it to the blocked replica.  It sounds like it isn't.. in which case 
> >> cranking it up to debug_ms=20 and reproducing will should you more about 
> >> when ceph is reading data off the socket and when it isn't.  And while it 
> >> is stuck you can identify teh fd involved, checking the socket status with 
> >> netstat, see if the 'data waiting flag' is set or not, and so on.
> >> 
> >> But times when we've gotten to that level it's (I think) always ended up 
> >> being either jumbo fram eissues with the network hardware or problems 
> >> with, say, bonding.  I'm not sure how the kernel version might have 
> >> affected the hosts interaction with the network but it seems like it's 
> >> possible...
> >> 
> >> sage
> >> 
> > 
> > Sage,
> > 
> > thanks for those suggestions. I'll try next week and get back. You are right about jumbo frames and bonding (which I forgot to
> > mention).
> > 
> > Just to make sure I understand correctly:
> > 
> > - Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
> > - And the effect is that there will be debug output in the log files? And even more, when set to 20?
> > 
> > 
> > Have a nice weekend,
> > 
> > 	Uwe
> > 
> > 
> >> 
> >>> 
> >>> 
> >>> I can provide more specific info if needed.
> >>> 
> >>> 
> >>> Thanks,
> >>> 
> >>>  Uwe
> >>> 
> >>> 
> >>> 
> >>> #### Hardware details ####
> >>> Host type 1:
> >>>  CPU: 2x Intel Xeon E5-2670
> >>>  RAM: 64GiB
> >>>  Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
> >>>  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> >>> 
> >>> Host type 2:
> >>>  CPU: 2x Intel Xeon E5606
> >>>  RAM: 96GiB
> >>>  Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
> >>>  connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE Myricom (Ceph & KVM, MTU 9000)
> >>> #### End Hardware ####
> >>> 
> >>> #### Ceph OSD Tree ####
> >>> ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
> >>> -1       12.72653 root default
> >>> -2        1.36418     host px-alpha-cluster
> >>>  0   hdd  0.22729         osd.0                    up  1.00000 1.00000
> >>>  1   hdd  0.22729         osd.1                    up  1.00000 1.00000
> >>>  2   hdd  0.90959         osd.2                    up  1.00000 1.00000
> >>> -3        1.36418     host px-bravo-cluster
> >>>  3   hdd  0.22729         osd.3                    up  1.00000 1.00000
> >>>  4   hdd  0.22729         osd.4                    up  1.00000 1.00000
> >>>  5   hdd  0.90959         osd.5                    up  1.00000 1.00000
> >>> -4        2.04648     host px-charlie-cluster
> >>>  6   hdd  0.90959         osd.6                    up  1.00000 1.00000
> >>>  7   hdd  0.22729         osd.7                    up  1.00000 1.00000
> >>>  8   hdd  0.90959         osd.8                    up  1.00000 1.00000
> >>> -5        2.04648     host px-delta-cluster
> >>>  9   hdd  0.22729         osd.9                    up  1.00000 1.00000
> >>> 10   hdd  0.90959         osd.10                   up  1.00000 1.00000
> >>> 11   hdd  0.90959         osd.11                   up  1.00000 1.00000
> >>> -11        2.72516     host px-echo-cluster
> >>> 12   hdd  0.45419         osd.12                   up  1.00000 1.00000
> >>> 13   hdd  0.45419         osd.13                   up  1.00000 1.00000
> >>> 14   hdd  0.45419         osd.14                   up  1.00000 1.00000
> >>> 15   hdd  0.45419         osd.15                   up  1.00000 1.00000
> >>> 16   hdd  0.45419         osd.16                   up  1.00000 1.00000
> >>> 17   hdd  0.45419         osd.17                   up  1.00000 1.00000
> >>> -13        3.18005     host px-foxtrott-cluster
> >>> 18   hdd  0.45419         osd.18                   up  1.00000 1.00000
> >>> 19   hdd  0.45419         osd.19                   up  1.00000 1.00000
> >>> 20   hdd  0.45419         osd.20                   up  1.00000 1.00000
> >>> 21   hdd  0.90909         osd.21                   up  1.00000 1.00000
> >>> 22   hdd  0.45419         osd.22                   up  1.00000 1.00000
> >>> 23   hdd  0.45419         osd.23                   up  1.00000 1.00000
> >>> #### End OSD Tree ####
> >>> 
> >>> #### CRUSH map ####
> >>> # begin crush map
> >>> tunable choose_local_tries 0
> >>> tunable choose_local_fallback_tries 0
> >>> tunable choose_total_tries 50
> >>> tunable chooseleaf_descend_once 1
> >>> tunable chooseleaf_vary_r 1
> >>> tunable chooseleaf_stable 1
> >>> tunable straw_calc_version 1
> >>> tunable allowed_bucket_algs 54
> >>> 
> >>> # devices
> >>> device 0 osd.0 class hdd
> >>> device 1 osd.1 class hdd
> >>> device 2 osd.2 class hdd
> >>> device 3 osd.3 class hdd
> >>> device 4 osd.4 class hdd
> >>> device 5 osd.5 class hdd
> >>> device 6 osd.6 class hdd
> >>> device 7 osd.7 class hdd
> >>> device 8 osd.8 class hdd
> >>> device 9 osd.9 class hdd
> >>> device 10 osd.10 class hdd
> >>> device 11 osd.11 class hdd
> >>> device 12 osd.12 class hdd
> >>> device 13 osd.13 class hdd
> >>> device 14 osd.14 class hdd
> >>> device 15 osd.15 class hdd
> >>> device 16 osd.16 class hdd
> >>> device 17 osd.17 class hdd
> >>> device 18 osd.18 class hdd
> >>> device 19 osd.19 class hdd
> >>> device 20 osd.20 class hdd
> >>> device 21 osd.21 class hdd
> >>> device 22 osd.22 class hdd
> >>> device 23 osd.23 class hdd
> >>> 
> >>> # types
> >>> type 0 osd
> >>> type 1 host
> >>> type 2 chassis
> >>> type 3 rack
> >>> type 4 row
> >>> type 5 pdu
> >>> type 6 pod
> >>> type 7 room
> >>> type 8 datacenter
> >>> type 9 region
> >>> type 10 root
> >>> 
> >>> # buckets
> >>> host px-alpha-cluster {
> >>>  id -2   # do not change unnecessarily
> >>>  id -6 class hdd   # do not change unnecessarily
> >>>  # weight 1.364
> >>>  alg straw
> >>>  hash 0  # rjenkins1
> >>>  item osd.0 weight 0.227
> >>>  item osd.1 weight 0.227
> >>>  item osd.2 weight 0.910
> >>> }
> >>> host px-bravo-cluster {
> >>>  id -3   # do not change unnecessarily
> >>>  id -7 class hdd   # do not change unnecessarily
> >>>  # weight 1.364
> >>>  alg straw
> >>>  hash 0  # rjenkins1
> >>>  item osd.3 weight 0.227
> >>>  item osd.4 weight 0.227
> >>>  item osd.5 weight 0.910
> >>> }
> >>> host px-charlie-cluster {
> >>>  id -4   # do not change unnecessarily
> >>>  id -8 class hdd   # do not change unnecessarily
> >>>  # weight 2.046
> >>>  alg straw
> >>>  hash 0  # rjenkins1
> >>>  item osd.7 weight 0.227
> >>>  item osd.8 weight 0.910
> >>>  item osd.6 weight 0.910
> >>> }
> >>> host px-delta-cluster {
> >>>  id -5   # do not change unnecessarily
> >>>  id -9 class hdd   # do not change unnecessarily
> >>>  # weight 2.046
> >>>  alg straw
> >>>  hash 0  # rjenkins1
> >>>  item osd.9 weight 0.227
> >>>  item osd.10 weight 0.910
> >>>  item osd.11 weight 0.910
> >>> }
> >>> host px-echo-cluster {
> >>>  id -11    # do not change unnecessarily
> >>>  id -12 class hdd    # do not change unnecessarily
> >>>  # weight 2.725
> >>>  alg straw2
> >>>  hash 0  # rjenkins1
> >>>  item osd.12 weight 0.454
> >>>  item osd.13 weight 0.454
> >>>  item osd.14 weight 0.454
> >>>  item osd.16 weight 0.454
> >>>  item osd.17 weight 0.454
> >>>  item osd.15 weight 0.454
> >>> }
> >>> host px-foxtrott-cluster {
> >>>  id -13    # do not change unnecessarily
> >>>  id -14 class hdd    # do not change unnecessarily
> >>>  # weight 3.180
> >>>  alg straw2
> >>>  hash 0  # rjenkins1
> >>>  item osd.18 weight 0.454
> >>>  item osd.19 weight 0.454
> >>>  item osd.20 weight 0.454
> >>>  item osd.22 weight 0.454
> >>>  item osd.23 weight 0.454
> >>>  item osd.21 weight 0.909
> >>> }
> >>> root default {
> >>>  id -1   # do not change unnecessarily
> >>>  id -10 class hdd    # do not change unnecessarily
> >>>  # weight 12.727
> >>>  alg straw
> >>>  hash 0  # rjenkins1
> >>>  item px-alpha-cluster weight 1.364
> >>>  item px-bravo-cluster weight 1.364
> >>>  item px-charlie-cluster weight 2.046
> >>>  item px-delta-cluster weight 2.046
> >>>  item px-echo-cluster weight 2.725
> >>>  item px-foxtrott-cluster weight 3.180
> >>> }
> >>> 
> >>> # rules
> >>> rule replicated_ruleset {
> >>>  id 0
> >>>  type replicated
> >>>  min_size 1
> >>>  max_size 10
> >>>  step take default
> >>>  step chooseleaf firstn 0 type host
> >>>  step emit
> >>> }
> >>> 
> >>> # end crush map
> >>> #### End CRUSH ####
> 
>