Hi,
We have a cluster with 9 hosts and 540 HDDs using
Bluestore and containerized OSDs running luminous 12.2.4.
While trying to add new nodes, the cluster collapsed as it
could not keep up with establishing enough tcp connections. We
fixed sysctl to be able to handle more connections and also
recycle tw sockets faster. Currently, as we are trying to
restart the cluster by bringing up a few OSDs at a time, some
of the OSDs get very busy after around 360 of them come up.
iostats show that the busy OSDs are constantly reading from
the Bluestore partition. The number of busy OSDs per node vary
and norecover is set with no active clients. OSD logs don't
show anything other than cephx:
verify_authorizer errors which happen on
both busy and idle OSDs and doesn't seem to be related to
drive reads.
How can we figure
out why the OSDs are busy reading from the drives? If it
is some kind of recovery, is there a way to track
progress? Output of ceph -s and logs from a busy and idle
OSD are copied below.
Thanks
Chandra
Uptime stats with load
averages show variance across the9 older nodes.
02:43:44 up
19:21, 0
users, load
average: 0.88, 1.03, 1.06
02:43:44 up 7:58, 0 users, load
average: 16.91, 13.49, 12.43
02:43:44 up
1 day, 14 min,
0 users,
load average: 7.67, 6.70, 6.35
02:43:45 up 7:01, 0 users, load
average: 84.40, 84.20, 83.73
02:43:45 up 6:40, 1 user, load
average: 17.08, 17.40, 20.05
02:43:45 up
19:46, 0
users, load
average: 15.58, 11.93, 11.44
02:43:45 up
20:39, 0
users, load
average: 7.88, 6.50, 5.69
02:43:46 up
1 day, 1:20, 0 users, load
average: 5.03, 3.81, 3.49
02:43:46 up
1 day, 58 min,
0 users,
load average: 0.62, 1.00, 1.38
Ceph
Config
--------------
[global]
cluster
network = 192.168.13.0/24
fsid =
<>
mon
host = 172.16.13.101,172.16.13.102,172.16.13.103
mon
initial members = ctrl1,ctrl2,ctrl3
mon_max_pg_per_osd
= 750
mon_osd_backfillfull_ratio
= 0.92
mon_osd_down_out_interval
= 900
mon_osd_full_ratio
= 0.95
mon_osd_nearfull_ratio
= 0.85
osd_crush_chooseleaf_type
= 3
osd_heartbeat_grace
= 900
mon_osd_laggy_max_interval
= 900
osd_max_pg_per_osd_hard_ratio
= 1.0
public
network = 172.16.13.0/24
[mon]
mon_compact_on_start
= true
[osd]
osd_deep_scrub_interval
= 2419200
osd_deep_scrub_stride
= 4194304
osd_max_backfills
= 10
osd_max_object_size
= 276824064
osd_max_scrubs
= 1
osd_max_write_size
= 264
osd_pool_erasure_code_stripe_unit
= 2097152
osd_recovery_max_active
= 10
osd_heartbeat_interval
= 15
Data
nodes Sysctl params
-----------------------------
fs.aio-max-nr=1048576
kernel.pid_max=4194303
kernel.threads-max=2097152
net.core.netdev_max_backlog=65536
net.core.optmem_max=1048576
net.core.rmem_max=8388608
net.core.rmem_default=8388608
net.core.somaxconn=2048
net.core.wmem_max=8388608
net.core.wmem_default=8388608
vm.max_map_count=524288
vm.min_free_kbytes=262144
net.ipv4.tcp_tw_reuse=1
net.ipv4.tcp_max_syn_backlog=16384
net.ipv4.tcp_fin_timeout=10
net.ipv4.tcp_slow_start_after_idle=0
Ceph -s
output
---------------
root@ctrl1:/#
ceph -s
cluster:
id: 06126476-6deb-4baa-b7ca-50f5ccfacb68
health:
HEALTH_ERR
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
flag(s) set
704
osds down
9
hosts (540 osds) down
71
nearfull osd(s)
2
pool(s) nearfull
780664/74163111
objects misplaced (1.053%)
7724/8242239
objects unfound (0.094%)
396
PGs pending on creation
Reduced
data availability: 32597 pgs inactive, 29764 pgs down, 820
pgs peering, 74 pgs incomplete, 1 pg stale
Degraded
data redundancy: 679158/74163111 objects degraded
(0.916%), 1250 pgs degraded, 1106 pgs undersized
33
slow requests are blocked > 32 sec
9
stuck requests are blocked > 4096 sec
mons
ctrl1,ctrl2,ctrl3 are using a lot of disk space
services:
mon: 3
daemons, quorum ctrl1,ctrl2,ctrl3
mgr:
ctrl1(active), standbys: ctrl2, ctrl3
osd:
1080 osds: 376 up, 1080 in; 1963 remapped pgs
flags
noout,nobackfill,norebalance,norecover,noscrub,nodeep-scrub
data:
pools: 2 pools,
33280 pgs
objects:
8049k objects, 2073 TB
usage: 2277 TB
used, 458 TB / 2736 TB avail
pgs: 3.585%
pgs unknown
94.363%
pgs not active
679158/74163111
objects degraded (0.916%)
780664/74163111
objects misplaced (1.053%)
7724/8242239
objects unfound (0.094%)
29754
down
1193 unknown
535
peering
496
activating+undersized+degraded+remapped
284
remapped+peering
258
active+undersized+degraded+remapped
161
activating+degraded+remapped
143
active+recovering+undersized+degraded+remapped
89 active+undersized+degraded
76 active+clean+remapped
71 incomplete
48 active+undersized+remapped
46 undersized+degraded+peered
34 active+recovering+degraded+remapped
26 active+clean
21 activating+remapped
13 activating+undersized+degraded
10 down+remapped
9
active+recovery_wait+undersized+degraded+remapped
4
activating
3
remapped+incomplete
2
activating+undersized+remapped
1
stale+peering
1
undersized+peered
1
undersized+remapped+peered
1
activating+degraded
root@ctrl1:/#
ceph version
ceph
version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
luminous (stable)
osd.86 - busy
-------------------
2019-01-12
02:30:02.582363 7f00ea0f4700 0 osd.86
213711 No AuthAuthorizeHandler found for protocol 0
2019-01-12
02:30:02.582383 7f00ea0f4700 0 -- 192.168.13.5:6806/351612
>> 192.168.13.8:6868/770039
conn(0x55c4a8d89000 :6806
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer
2019-01-12
02:30:04.005541 7f00ea8f5700 0 auth:
could not find secret_id=7544
2019-01-12
02:30:04.005554 7f00ea8f5700 0 cephx:
verify_authorizer could not get service secret for service
osd secret_id=7544
2019-01-12
02:30:04.005557 7f00ea8f5700 0 -- 192.168.13.5:6806/351612
>> 192.168.13.3:6836/405613
conn(0x55c4f3617800 :6806
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer
2019-01-12
02:30:07.910816 7f00e98f3700 0 auth:
could not find secret_id=7550
2019-01-12
02:30:07.910864 7f00e98f3700 0 cephx:
verify_authorizer could not get service secret for service
osd secret_id=7550
2019-01-12
02:30:07.910884 7f00e98f3700 0 -- 192.168.13.5:6806/351612
>> 192.168.13.8:6824/767660
conn(0x55c4ec462800 :6806
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer
2019-01-12
02:30:16.982636 7f00ea8f5700 0 osd.86
213711 No AuthAuthorizeHandler found for protocol 0
2019-01-12
02:30:16.982640 7f00ea8f5700 0 -- 192.168.13.5:6806/351612
>> 192.168.13.6:6816/349322
conn(0x55c4aee6f000 :6806
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer
OSD.132
- idle
---------------
2019-01-12
02:31:38.370478 7fa508450700 0 auth:
could not find secret_id=7551
2019-01-12
02:31:38.370487 7fa508450700 0 cephx:
verify_authorizer could not get service secret for service
osd secret_id=7551
2019-01-12
02:31:38.370489 7fa508450700 0 -- 192.168.13.5:6854/356774
>> 192.168.13.8:6872/1201589
conn(0x563b3e46f000 :6854
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer
2019-01-12
02:31:41.121603 7fa509452700 0 auth:
could not find secret_id=7544
2019-01-12
02:31:41.121672 7fa509452700 0 cephx:
verify_authorizer could not get service secret for service
osd secret_id=7544
2019-01-12
02:31:41.121707 7fa509452700 0 -- 192.168.13.5:6854/356774
>> 192.168.13.9:6808/515991
conn(0x563b53997800 :6854
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_msg: got bad authorizer