the problem was a single osd daemon (not reported on health detail)
which slowed down the entire peering process, after restarting it the
cluster got back to normal.
On 11/4/2022 10:49 AM, Adrian Nicolae wrote:
ceph health detail
HEALTH_WARN Reduced data availability: 42 pgs inactive, 33 pgs
peering; 1 pool(s) have non-power-of-two pg_num; 2371 slow ops, oldest
one blocked for 6218 sec, daemons
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]...
have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 42 pgs inactive, 33
pgs peering
pg 6.eb is stuck peering for 54m, current state peering, last
acting [79,279,68,179,264,240]
pg 6.10f is stuck peering for 36m, current state peering, last
acting [288,161,37,63,178,240]
pg 6.115 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.139 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.17e is stuck peering for 103m, current state peering, last
acting [126,190,252,282,113,240]
pg 6.1a5 is stuck peering for 103m, current state peering, last
acting [41,158,240,177,66,228]
pg 6.1ae is stuck peering for 103m, current state peering, last
acting [186,240,162,221,289,219]
pg 6.1eb is stuck peering for 36m, current state peering, last
acting [220,240,184,226,205,254]
pg 6.21b is stuck peering for 58m, current state peering, last
acting [179,301,168,292,240,121]
pg 6.26d is stuck peering for 36m, current state peering, last
acting [68,305,240,47,137,184]
pg 6.348 is stuck peering for 77m, current state peering, last
acting [138,307,221,125,240,285]
pg 6.369 is stuck peering for 54m, current state peering, last
acting [35,66,240,254,58,179]
pg 6.39f is stuck peering for 28m, current state peering, last
acting [264,46,240,154,101,194]
pg 6.3ca is stuck peering for 58m, current state peering, last
acting [202,213,174,296,240,45]
pg 6.3cb is stuck inactive for 14m, current state unknown, last
acting []
pg 6.3e1 is stuck peering for 77m, current state peering, last
acting [115,168,240,85,56,26]
pg 6.3f3 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.473 is stuck peering for 36m, current state peering, last
acting [265,53,77,240,182,92]
pg 6.576 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.5a6 is stuck peering for 103m, current state peering, last
acting [257,37,240,54,263,68]
pg 6.5eb is stuck inactive for 14m, current state unknown, last
acting []
pg 6.63f is stuck peering for 85m, current state peering, last
acting [252,53,240,131,25,278]
pg 6.655 is stuck peering for 103m, current state peering, last
acting [103,267,222,308,240,277]
pg 6.6d5 is stuck peering for 36m, current state peering, last
acting [197,171,276,177,210,240]
pg 6.6f2 is stuck peering for 85m, current state peering, last
acting [174,122,81,129,304,240]
pg 6.721 is stuck peering for 51m, current state peering, last
acting [181,76,294,249,299,240]
pg 6.757 is stuck peering for 23m, current state peering, last
acting [288,194,213,240,37,22]
pg 6.785 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.793 is stuck peering for 77m, current state peering, last
acting [155,301,240,294,214,265]
pg 6.798 is stuck peering for 51m, current state peering, last
acting [186,278,196,211,260,240]
pg 6.79b is stuck peering for 54m, current state peering, last
acting [186,25,108,240,300,39]
pg 6.7b7 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.7c5 is stuck peering for 103m, current state peering, last
acting [130,179,266,240,162,294]
pg 6.7df is stuck peering for 36m, current state peering, last
acting [188,240,182,282,265,199]
pg 6.83c is stuck peering for 77m, current state peering, last
acting [155,81,228,65,207,240]
pg 6.85f is stuck peering for 103m, current state peering, last
acting [129,263,307,28,240,63]
pg 6.917 is stuck peering for 54m, current state peering, last
acting [84,179,240,295,92,269]
pg 6.939 is stuck inactive for 14m, current state unknown, last
acting []
pg 6.97b is stuck peering for 103m, current state peering, last
acting [34,96,293,129,147,240]
pg 6.97e is stuck peering for 103m, current state peering, last
acting [126,190,252,282,113,240]
pg 6.9a5 is stuck peering for 103m, current state peering, last
acting [41,158,240,186,66,228]
pg 6.9ae is stuck peering for 103m, current state peering, last
acting [186,240,162,221,289,219]
[WRN] POOL_PG_NUM_NOT_POWER_OF_TWO: 1 pool(s) have non-power-of-two
pg_num
pool 'us-east-1.rgw.buckets.data' pg_num 2480 is not a power of two
[WRN] SLOW_OPS: 2371 slow ops, oldest one blocked for 6218 sec,
daemons
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]...
have slow ops.
On 11/4/2022 10:45 AM, Adrian Nicolae wrote:
Hi,
We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We
started to increase the pg_num for the data bucket for more than a
month, I usually added 64 pgs in every step I didn't have any issue.
The cluster was healthy before increasing the pgs.
Today I've added 128 pgs and the cluster is stuck with some unknown
pgs and some other in peering state. I've restarted a few osds with
slow_ops and even a few hosts but it didn't change anything. We don't
have any networking issue . Do you have any suggestion ? Our
service is completely down ...
cluster:
id: 322ef292-d129-11eb-96b2-a1b38fd61d55
health: HEALTH_WARN
Slow OSD heartbeats on back (longest 1517.814ms)
Slow OSD heartbeats on front (longest 1551.680ms)
Reduced data availability: 42 pgs inactive, 33 pgs peering
1 pool(s) have non-power-of-two pg_num
2888 slow ops, oldest one blocked for 6028 sec, daemons
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]...
have slow ops.
services:
mon: 5 daemons, quorum osd-new-01,osd04,osd05,osd09,osd22 (age 11m)
mgr: osd-new-01.babahi(active, since 11m), standbys: osd02.wqcizg
osd: 311 osds: 311 up (since 3m), 311 in (since 3m); 29 remapped pgs
rgw: 26 daemons active (26 hosts, 1 zones)
data:
pools: 8 pools, 2649 pgs
objects: 590.57M objects, 1.5 PiB
usage: 2.2 PiB used, 1.2 PiB / 3.4 PiB avail
pgs: 0.340% pgs unknown
1.246% pgs not active
4056622/3539747751 objects misplaced (0.115%)
2529 active+clean
33 peering
31 active+clean+laggy
26 active+remapped+backfilling
18 active+clean+scrubbing+deep
9 unknown
3 active+remapped+backfill_wait
io:
client: 38 KiB/s rd, 0 B/s wr, 37 op/s rd, 25 op/s wr
recovery: 426 MiB/s, 158 objects/s
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx