Re: Help needed for diagnosing slow_requests

陶冬冬 <tdd21151186@xxxxxxxxx> · Wed, 22 Aug 2018 10:53:25 +0800

Hey Sage,

I just saw your comments about Jumbo frame might cause this kind slow request.
we just met this kind issue a few weeks ago. but there is also osd op
thread timeout and even reaches suicide timeout.
I did trace the osd log with 20 severity.  from what i can see,  that
timeout thread is just not get executed for these time.
that leads corresponding slow request is stucked in event “queued_for_pg”.
it is very strange to me why the misconfigured mtu on the network side
would cause the osd op thread timeout ?

Thanks & Regards,
Dongdong

在 2018年8月17日，下午8:29，Uwe Sauter <uwe.sauter.de@xxxxxxxxx> 写道：

Am 17.08.18 um 14:23 schrieb Sage Weil:

On Fri, 17 Aug 2018, Uwe Sauter wrote:

Dear devs,

I'm posting on ceph-devel because I didn't get any feedback on
ceph-users. This is an act of desperation…

TL;DR: Cluster runs good with Kernel 4.13, produces slow_requests with
Kernel 4.15. How to debug?

I'm running a combined Ceph / KVM cluster consisting of 6 hosts of 2
different kinds (details at the end).
The main difference between those hosts is CPU generation (Westmere /
Sandybridge),  and number of OSD disks.

The cluster runs Proxmox 5.2 which essentially is a Debian 9 but using
Ubuntu kernels and the Proxmox
virtualization framework. The Proxmox WebUI also integrates some kind
of Ceph management.

On the Ceph side, the cluster has 3 nodes that run MGR, MON and OSDs
while the other 3 only run OSDs.
OSD tree and CRUSH map are at the end. Ceph version is 12.2.7. All
OSDs are BlueStore.

Now here's the thing:

Some weeks ago Proxmox upgraded from kernel 4.13 to 4.15. Since then
I'm getting slow requests that
cause blocked IO inside the VMs that are running on the cluster (but
not necessarily on the host
with the OSD causing the slow request).

If I boot back into 4.13 then Ceph runs smoothly again.

I'm seeking for help to debug this issue as I'm running out of ideas
what I could else do.
So far I was using "ceph daemon osd.X dump_blocked_ops"to diagnose
which always indicates that the
primary OSD scheduled copies on two secondaries (e.g. OSD 15: "event":
"waiting for subops from 9,23")
but only one of those succeeds ("event": "sub_op_commit_rec from 23").
The other one blocks (there is
no commit message from OSD 9).

On OSD 9 there is no blocked operation ("num_blocked_ops": 0) which
confuses me a lot. If this OSD
does not commit there should be an operation that does not succeed,
should it not?

Restarting the (primary) OSD with the blocked operation clears the
error, restarting the secondary OSD that
does not commit has no effect on the issue.

Any ideas on how to debug this further? What should I do to identify
this as a Ceph issue and not
a networking or kernel issue?

This kind of issue has usually turned out to be a networking issue in the
past (either kernel or hardware, or some combinatino of hte two).  I would
suggest adding debug_ms=1 and reproducing and see if the replicated op
makes it to the blocked replica.  It sounds like it isn't.. in which case
cranking it up to debug_ms=20 and reproducing will should you more about
when ceph is reading data off the socket and when it isn't.  And while it
is stuck you can identify teh fd involved, checking the socket status with
netstat, see if the 'data waiting flag' is set or not, and so on.

But times when we've gotten to that level it's (I think) always ended up
being either jumbo fram eissues with the network hardware or problems
with, say, bonding.  I'm not sure how the kernel version might have
affected the hosts interaction with the network but it seems like it's
possible...

sage

Sage,

thanks for those suggestions. I'll try next week and get back. You are
right about jumbo frames and bonding (which I forgot to
mention).

Just to make sure I understand correctly:

- Setting debug_ms=1 or debug_ms=20 is done in ceph.conf?
- And the effect is that there will be debug output in the log files?
And even more, when set to 20?

Have a nice weekend,

Uwe

I can provide more specific info if needed.

Thanks,

 Uwe

#### Hardware details ####
Host type 1:
 CPU: 2x Intel Xeon E5-2670
 RAM: 64GiB
 Storage: 1x SSD for OS, 3x HDD for Ceph (232GiB, some replaced by 931GiB)
 connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE
Myricom (Ceph & KVM, MTU 9000)

Host type 2:
 CPU: 2x Intel Xeon E5606
 RAM: 96GiB
 Storage: 1x HDD for OS, 5x HDD for Ceph (465GiB, some replaced by 931GiB)
 connected NIC: 1x 1GbE Intel (management access, MTU 1500), 1x 10GbE
Myricom (Ceph & KVM, MTU 9000)
#### End Hardware ####

#### Ceph OSD Tree ####
ID  CLASS WEIGHT   TYPE NAME                    STATUS REWEIGHT PRI-AFF
-1       12.72653 root default
-2        1.36418     host px-alpha-cluster
 0   hdd  0.22729         osd.0                    up  1.00000 1.00000
 1   hdd  0.22729         osd.1                    up  1.00000 1.00000
 2   hdd  0.90959         osd.2                    up  1.00000 1.00000
-3        1.36418     host px-bravo-cluster
 3   hdd  0.22729         osd.3                    up  1.00000 1.00000
 4   hdd  0.22729         osd.4                    up  1.00000 1.00000
 5   hdd  0.90959         osd.5                    up  1.00000 1.00000
-4        2.04648     host px-charlie-cluster
 6   hdd  0.90959         osd.6                    up  1.00000 1.00000
 7   hdd  0.22729         osd.7                    up  1.00000 1.00000
 8   hdd  0.90959         osd.8                    up  1.00000 1.00000
-5        2.04648     host px-delta-cluster
 9   hdd  0.22729         osd.9                    up  1.00000 1.00000
10   hdd  0.90959         osd.10                   up  1.00000 1.00000
11   hdd  0.90959         osd.11                   up  1.00000 1.00000
-11        2.72516     host px-echo-cluster
12   hdd  0.45419         osd.12                   up  1.00000 1.00000
13   hdd  0.45419         osd.13                   up  1.00000 1.00000
14   hdd  0.45419         osd.14                   up  1.00000 1.00000
15   hdd  0.45419         osd.15                   up  1.00000 1.00000
16   hdd  0.45419         osd.16                   up  1.00000 1.00000
17   hdd  0.45419         osd.17                   up  1.00000 1.00000
-13        3.18005     host px-foxtrott-cluster
18   hdd  0.45419         osd.18                   up  1.00000 1.00000
19   hdd  0.45419         osd.19                   up  1.00000 1.00000
20   hdd  0.45419         osd.20                   up  1.00000 1.00000
21   hdd  0.90909         osd.21                   up  1.00000 1.00000
22   hdd  0.45419         osd.22                   up  1.00000 1.00000
23   hdd  0.45419         osd.23                   up  1.00000 1.00000
#### End OSD Tree ####

#### CRUSH map ####
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0 class hdd
device 1 osd.1 class hdd
device 2 osd.2 class hdd
device 3 osd.3 class hdd
device 4 osd.4 class hdd
device 5 osd.5 class hdd
device 6 osd.6 class hdd
device 7 osd.7 class hdd
device 8 osd.8 class hdd
device 9 osd.9 class hdd
device 10 osd.10 class hdd
device 11 osd.11 class hdd
device 12 osd.12 class hdd
device 13 osd.13 class hdd
device 14 osd.14 class hdd
device 15 osd.15 class hdd
device 16 osd.16 class hdd
device 17 osd.17 class hdd
device 18 osd.18 class hdd
device 19 osd.19 class hdd
device 20 osd.20 class hdd
device 21 osd.21 class hdd
device 22 osd.22 class hdd
device 23 osd.23 class hdd

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host px-alpha-cluster {
 id -2   # do not change unnecessarily
 id -6 class hdd   # do not change unnecessarily
 # weight 1.364
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 0.227
 item osd.1 weight 0.227
 item osd.2 weight 0.910
}
host px-bravo-cluster {
 id -3   # do not change unnecessarily
 id -7 class hdd   # do not change unnecessarily
 # weight 1.364
 alg straw
 hash 0  # rjenkins1
 item osd.3 weight 0.227
 item osd.4 weight 0.227
 item osd.5 weight 0.910
}
host px-charlie-cluster {
 id -4   # do not change unnecessarily
 id -8 class hdd   # do not change unnecessarily
 # weight 2.046
 alg straw
 hash 0  # rjenkins1
 item osd.7 weight 0.227
 item osd.8 weight 0.910
 item osd.6 weight 0.910
}
host px-delta-cluster {
 id -5   # do not change unnecessarily
 id -9 class hdd   # do not change unnecessarily
 # weight 2.046
 alg straw
 hash 0  # rjenkins1
 item osd.9 weight 0.227
 item osd.10 weight 0.910
 item osd.11 weight 0.910
}
host px-echo-cluster {
 id -11    # do not change unnecessarily
 id -12 class hdd    # do not change unnecessarily
 # weight 2.725
 alg straw2
 hash 0  # rjenkins1
 item osd.12 weight 0.454
 item osd.13 weight 0.454
 item osd.14 weight 0.454
 item osd.16 weight 0.454
 item osd.17 weight 0.454
 item osd.15 weight 0.454
}
host px-foxtrott-cluster {
 id -13    # do not change unnecessarily
 id -14 class hdd    # do not change unnecessarily
 # weight 3.180
 alg straw2
 hash 0  # rjenkins1
 item osd.18 weight 0.454
 item osd.19 weight 0.454
 item osd.20 weight 0.454
 item osd.22 weight 0.454
 item osd.23 weight 0.454
 item osd.21 weight 0.909
}
root default {
 id -1   # do not change unnecessarily
 id -10 class hdd    # do not change unnecessarily
 # weight 12.727
 alg straw
 hash 0  # rjenkins1
 item px-alpha-cluster weight 1.364
 item px-bravo-cluster weight 1.364
 item px-charlie-cluster weight 2.046
 item px-delta-cluster weight 2.046
 item px-echo-cluster weight 2.725
 item px-foxtrott-cluster weight 3.180
}

# rules
rule replicated_ruleset {
 id 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
}

# end crush map
#### End CRUSH ####