Re: ceph is stuck after increasing pg_nums

Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> · Fri, 4 Nov 2022 12:22:42 +0200

the problem was a single osd daemon (not reported on health detail) 
which slowed down the entire peering process, after restarting it the 
cluster got back to normal.

On 11/4/2022 10:49 AM, Adrian Nicolae wrote:
 ceph health detail
HEALTH_WARN Reduced data availability: 42 pgs inactive, 33 pgs 
peering; 1 pool(s) have non-power-of-two pg_num; 2371 slow ops, oldest 
one blocked for 6218 sec, daemons 
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... 
have slow ops.
[WRN] PG_AVAILABILITY: Reduced data availability: 42 pgs inactive, 33 
pgs peering
    pg 6.eb is stuck peering for 54m, current state peering, last 
acting [79,279,68,179,264,240]
    pg 6.10f is stuck peering for 36m, current state peering, last 
acting [288,161,37,63,178,240]
    pg 6.115 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.139 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.17e is stuck peering for 103m, current state peering, last 
acting [126,190,252,282,113,240]
    pg 6.1a5 is stuck peering for 103m, current state peering, last 
acting [41,158,240,177,66,228]
    pg 6.1ae is stuck peering for 103m, current state peering, last 
acting [186,240,162,221,289,219]
    pg 6.1eb is stuck peering for 36m, current state peering, last 
acting [220,240,184,226,205,254]
    pg 6.21b is stuck peering for 58m, current state peering, last 
acting [179,301,168,292,240,121]
    pg 6.26d is stuck peering for 36m, current state peering, last 
acting [68,305,240,47,137,184]
    pg 6.348 is stuck peering for 77m, current state peering, last 
acting [138,307,221,125,240,285]
    pg 6.369 is stuck peering for 54m, current state peering, last 
acting [35,66,240,254,58,179]
    pg 6.39f is stuck peering for 28m, current state peering, last 
acting [264,46,240,154,101,194]
    pg 6.3ca is stuck peering for 58m, current state peering, last 
acting [202,213,174,296,240,45]
    pg 6.3cb is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.3e1 is stuck peering for 77m, current state peering, last 
acting [115,168,240,85,56,26]
    pg 6.3f3 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.473 is stuck peering for 36m, current state peering, last 
acting [265,53,77,240,182,92]
    pg 6.576 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.5a6 is stuck peering for 103m, current state peering, last 
acting [257,37,240,54,263,68]
    pg 6.5eb is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.63f is stuck peering for 85m, current state peering, last 
acting [252,53,240,131,25,278]
    pg 6.655 is stuck peering for 103m, current state peering, last 
acting [103,267,222,308,240,277]
    pg 6.6d5 is stuck peering for 36m, current state peering, last 
acting [197,171,276,177,210,240]
    pg 6.6f2 is stuck peering for 85m, current state peering, last 
acting [174,122,81,129,304,240]
    pg 6.721 is stuck peering for 51m, current state peering, last 
acting [181,76,294,249,299,240]
    pg 6.757 is stuck peering for 23m, current state peering, last 
acting [288,194,213,240,37,22]
    pg 6.785 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.793 is stuck peering for 77m, current state peering, last 
acting [155,301,240,294,214,265]
    pg 6.798 is stuck peering for 51m, current state peering, last 
acting [186,278,196,211,260,240]
    pg 6.79b is stuck peering for 54m, current state peering, last 
acting [186,25,108,240,300,39]
    pg 6.7b7 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.7c5 is stuck peering for 103m, current state peering, last 
acting [130,179,266,240,162,294]
    pg 6.7df is stuck peering for 36m, current state peering, last 
acting [188,240,182,282,265,199]
    pg 6.83c is stuck peering for 77m, current state peering, last 
acting [155,81,228,65,207,240]
    pg 6.85f is stuck peering for 103m, current state peering, last 
acting [129,263,307,28,240,63]
    pg 6.917 is stuck peering for 54m, current state peering, last 
acting [84,179,240,295,92,269]
    pg 6.939 is stuck inactive for 14m, current state unknown, last 
acting []
    pg 6.97b is stuck peering for 103m, current state peering, last 
acting [34,96,293,129,147,240]
    pg 6.97e is stuck peering for 103m, current state peering, last 
acting [126,190,252,282,113,240]
    pg 6.9a5 is stuck peering for 103m, current state peering, last 
acting [41,158,240,186,66,228]
    pg 6.9ae is stuck peering for 103m, current state peering, last 
acting [186,240,162,221,289,219]
[WRN] POOL_PG_NUM_NOT_POWER_OF_TWO: 1 pool(s) have non-power-of-two 
pg_num
    pool 'us-east-1.rgw.buckets.data' pg_num 2480 is not a power of two
[WRN] SLOW_OPS: 2371 slow ops, oldest one blocked for 6218 sec, 
daemons 
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... 
have slow ops.

On 11/4/2022 10:45 AM, Adrian Nicolae wrote:
Hi,

We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We 
started to increase the pg_num for the data bucket for more than a 
month, I usually added 64 pgs in every step I didn't have any issue. 
The cluster was healthy before increasing the pgs.

Today I've added 128 pgs  and the cluster is stuck with some unknown 
pgs and some other in peering state. I've restarted a few osds with 
slow_ops and even a few hosts but it didn't change anything. We don't 
have any networking issue .  Do you have any suggestion ?  Our 
service is completely down ...

  cluster:
    id:     322ef292-d129-11eb-96b2-a1b38fd61d55
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 1517.814ms)
            Slow OSD heartbeats on front (longest 1551.680ms)
            Reduced data availability: 42 pgs inactive, 33 pgs peering
            1 pool(s) have non-power-of-two pg_num
            2888 slow ops, oldest one blocked for 6028 sec, daemons 
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... 
have slow ops.

  services:
    mon: 5 daemons, quorum osd-new-01,osd04,osd05,osd09,osd22 (age 11m)
    mgr: osd-new-01.babahi(active, since 11m), standbys: osd02.wqcizg
    osd: 311 osds: 311 up (since 3m), 311 in (since 3m); 29 remapped pgs
    rgw: 26 daemons active (26 hosts, 1 zones)

  data:
    pools:   8 pools, 2649 pgs
    objects: 590.57M objects, 1.5 PiB
    usage:   2.2 PiB used, 1.2 PiB / 3.4 PiB avail
    pgs:     0.340% pgs unknown
             1.246% pgs not active
             4056622/3539747751 objects misplaced (0.115%)
             2529 active+clean
             33   peering
             31   active+clean+laggy
             26   active+remapped+backfilling
             18   active+clean+scrubbing+deep
             9    unknown
             3    active+remapped+backfill_wait

  io:
    client:   38 KiB/s rd, 0 B/s wr, 37 op/s rd, 25 op/s wr
    recovery: 426 MiB/s, 158 objects/s

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx