Re: ceph is stuck after increasing pg_nums

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



the problem was a single osd daemon (not reported on health detail) which slowed down the entire peering process, after restarting it the cluster got back to normal.


On 11/4/2022 10:49 AM, Adrian Nicolae wrote:
 ceph health detail
HEALTH_WARN Reduced data availability: 42 pgs inactive, 33 pgs peering; 1 pool(s) have non-power-of-two pg_num; 2371 slow ops, oldest one blocked for 6218 sec, daemons [osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... have slow ops. [WRN] PG_AVAILABILITY: Reduced data availability: 42 pgs inactive, 33 pgs peering     pg 6.eb is stuck peering for 54m, current state peering, last acting [79,279,68,179,264,240]     pg 6.10f is stuck peering for 36m, current state peering, last acting [288,161,37,63,178,240]     pg 6.115 is stuck inactive for 14m, current state unknown, last acting []     pg 6.139 is stuck inactive for 14m, current state unknown, last acting []     pg 6.17e is stuck peering for 103m, current state peering, last acting [126,190,252,282,113,240]     pg 6.1a5 is stuck peering for 103m, current state peering, last acting [41,158,240,177,66,228]     pg 6.1ae is stuck peering for 103m, current state peering, last acting [186,240,162,221,289,219]     pg 6.1eb is stuck peering for 36m, current state peering, last acting [220,240,184,226,205,254]     pg 6.21b is stuck peering for 58m, current state peering, last acting [179,301,168,292,240,121]     pg 6.26d is stuck peering for 36m, current state peering, last acting [68,305,240,47,137,184]     pg 6.348 is stuck peering for 77m, current state peering, last acting [138,307,221,125,240,285]     pg 6.369 is stuck peering for 54m, current state peering, last acting [35,66,240,254,58,179]     pg 6.39f is stuck peering for 28m, current state peering, last acting [264,46,240,154,101,194]     pg 6.3ca is stuck peering for 58m, current state peering, last acting [202,213,174,296,240,45]     pg 6.3cb is stuck inactive for 14m, current state unknown, last acting []     pg 6.3e1 is stuck peering for 77m, current state peering, last acting [115,168,240,85,56,26]     pg 6.3f3 is stuck inactive for 14m, current state unknown, last acting []     pg 6.473 is stuck peering for 36m, current state peering, last acting [265,53,77,240,182,92]     pg 6.576 is stuck inactive for 14m, current state unknown, last acting []     pg 6.5a6 is stuck peering for 103m, current state peering, last acting [257,37,240,54,263,68]     pg 6.5eb is stuck inactive for 14m, current state unknown, last acting []     pg 6.63f is stuck peering for 85m, current state peering, last acting [252,53,240,131,25,278]     pg 6.655 is stuck peering for 103m, current state peering, last acting [103,267,222,308,240,277]     pg 6.6d5 is stuck peering for 36m, current state peering, last acting [197,171,276,177,210,240]     pg 6.6f2 is stuck peering for 85m, current state peering, last acting [174,122,81,129,304,240]     pg 6.721 is stuck peering for 51m, current state peering, last acting [181,76,294,249,299,240]     pg 6.757 is stuck peering for 23m, current state peering, last acting [288,194,213,240,37,22]     pg 6.785 is stuck inactive for 14m, current state unknown, last acting []     pg 6.793 is stuck peering for 77m, current state peering, last acting [155,301,240,294,214,265]     pg 6.798 is stuck peering for 51m, current state peering, last acting [186,278,196,211,260,240]     pg 6.79b is stuck peering for 54m, current state peering, last acting [186,25,108,240,300,39]     pg 6.7b7 is stuck inactive for 14m, current state unknown, last acting []     pg 6.7c5 is stuck peering for 103m, current state peering, last acting [130,179,266,240,162,294]     pg 6.7df is stuck peering for 36m, current state peering, last acting [188,240,182,282,265,199]     pg 6.83c is stuck peering for 77m, current state peering, last acting [155,81,228,65,207,240]     pg 6.85f is stuck peering for 103m, current state peering, last acting [129,263,307,28,240,63]     pg 6.917 is stuck peering for 54m, current state peering, last acting [84,179,240,295,92,269]     pg 6.939 is stuck inactive for 14m, current state unknown, last acting []     pg 6.97b is stuck peering for 103m, current state peering, last acting [34,96,293,129,147,240]     pg 6.97e is stuck peering for 103m, current state peering, last acting [126,190,252,282,113,240]     pg 6.9a5 is stuck peering for 103m, current state peering, last acting [41,158,240,186,66,228]     pg 6.9ae is stuck peering for 103m, current state peering, last acting [186,240,162,221,289,219] [WRN] POOL_PG_NUM_NOT_POWER_OF_TWO: 1 pool(s) have non-power-of-two pg_num
    pool 'us-east-1.rgw.buckets.data' pg_num 2480 is not a power of two
[WRN] SLOW_OPS: 2371 slow ops, oldest one blocked for 6218 sec, daemons [osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... have slow ops.

On 11/4/2022 10:45 AM, Adrian Nicolae wrote:
Hi,

We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We started to increase the pg_num for the data bucket for more than a month, I usually added 64 pgs in every step I didn't have any issue. The cluster was healthy before increasing the pgs.

Today I've added 128 pgs  and the cluster is stuck with some unknown pgs and some other in peering state. I've restarted a few osds with slow_ops and even a few hosts but it didn't change anything. We don't have any networking issue .  Do you have any suggestion ?  Our service is completely down ...

  cluster:
    id:     322ef292-d129-11eb-96b2-a1b38fd61d55
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 1517.814ms)
            Slow OSD heartbeats on front (longest 1551.680ms)
            Reduced data availability: 42 pgs inactive, 33 pgs peering
            1 pool(s) have non-power-of-two pg_num
            2888 slow ops, oldest one blocked for 6028 sec, daemons [osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... have slow ops.

  services:
    mon: 5 daemons, quorum osd-new-01,osd04,osd05,osd09,osd22 (age 11m)
    mgr: osd-new-01.babahi(active, since 11m), standbys: osd02.wqcizg
    osd: 311 osds: 311 up (since 3m), 311 in (since 3m); 29 remapped pgs
    rgw: 26 daemons active (26 hosts, 1 zones)

  data:
    pools:   8 pools, 2649 pgs
    objects: 590.57M objects, 1.5 PiB
    usage:   2.2 PiB used, 1.2 PiB / 3.4 PiB avail
    pgs:     0.340% pgs unknown
             1.246% pgs not active
             4056622/3539747751 objects misplaced (0.115%)
             2529 active+clean
             33   peering
             31   active+clean+laggy
             26   active+remapped+backfilling
             18   active+clean+scrubbing+deep
             9    unknown
             3    active+remapped+backfill_wait

  io:
    client:   38 KiB/s rd, 0 B/s wr, 37 op/s rd, 25 op/s wr
    recovery: 426 MiB/s, 158 objects/s


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux