Hi,
We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We
started to increase the pg_num for the data bucket for more than a
month, I usually added 64 pgs in every step I didn't have any issue. The
cluster was healthy before increasing the pgs.
Today I've added 128 pgs and the cluster is stuck with some unknown pgs
and some other in peering state. I've restarted a few osds with slow_ops
and even a few hosts but it didn't change anything. We don't have any
networking issue . Do you have any suggestion ? Our service is
completely down ...
cluster:
id: 322ef292-d129-11eb-96b2-a1b38fd61d55
health: HEALTH_WARN
Slow OSD heartbeats on back (longest 1517.814ms)
Slow OSD heartbeats on front (longest 1551.680ms)
Reduced data availability: 42 pgs inactive, 33 pgs peering
1 pool(s) have non-power-of-two pg_num
2888 slow ops, oldest one blocked for 6028 sec, daemons
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]...
have slow ops.
services:
mon: 5 daemons, quorum osd-new-01,osd04,osd05,osd09,osd22 (age 11m)
mgr: osd-new-01.babahi(active, since 11m), standbys: osd02.wqcizg
osd: 311 osds: 311 up (since 3m), 311 in (since 3m); 29 remapped pgs
rgw: 26 daemons active (26 hosts, 1 zones)
data:
pools: 8 pools, 2649 pgs
objects: 590.57M objects, 1.5 PiB
usage: 2.2 PiB used, 1.2 PiB / 3.4 PiB avail
pgs: 0.340% pgs unknown
1.246% pgs not active
4056622/3539747751 objects misplaced (0.115%)
2529 active+clean
33 peering
31 active+clean+laggy
26 active+remapped+backfilling
18 active+clean+scrubbing+deep
9 unknown
3 active+remapped+backfill_wait
io:
client: 38 KiB/s rd, 0 B/s wr, 37 op/s rd, 25 op/s wr
recovery: 426 MiB/s, 158 objects/s
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx