ceph is stuck after increasing pg_nums

Adrian Nicolae <adrian.nicolae@xxxxxxxxxx> · Fri, 4 Nov 2022 10:45:47 +0200

Hi,

We have a Pacific cluster (16.2.4) with 30 servers and 30 osds. We 
started to increase the pg_num for the data bucket for more than a 
month, I usually added 64 pgs in every step I didn't have any issue. The 
cluster was healthy before increasing the pgs.

Today I've added 128 pgs  and the cluster is stuck with some unknown pgs 
and some other in peering state. I've restarted a few osds with slow_ops 
and even a few hosts but it didn't change anything. We don't have any 
networking issue .  Do you have any suggestion ?  Our service is 
completely down ...

  cluster:
    id:     322ef292-d129-11eb-96b2-a1b38fd61d55
    health: HEALTH_WARN
            Slow OSD heartbeats on back (longest 1517.814ms)
            Slow OSD heartbeats on front (longest 1551.680ms)
            Reduced data availability: 42 pgs inactive, 33 pgs peering
            1 pool(s) have non-power-of-two pg_num
            2888 slow ops, oldest one blocked for 6028 sec, daemons 
[osd.103,osd.115,osd.126,osd.129,osd.130,osd.138,osd.155,osd.174,osd.179,osd.181]... 
have slow ops.

  services:
    mon: 5 daemons, quorum osd-new-01,osd04,osd05,osd09,osd22 (age 11m)
    mgr: osd-new-01.babahi(active, since 11m), standbys: osd02.wqcizg
    osd: 311 osds: 311 up (since 3m), 311 in (since 3m); 29 remapped pgs
    rgw: 26 daemons active (26 hosts, 1 zones)

  data:
    pools:   8 pools, 2649 pgs
    objects: 590.57M objects, 1.5 PiB
    usage:   2.2 PiB used, 1.2 PiB / 3.4 PiB avail
    pgs:     0.340% pgs unknown
             1.246% pgs not active
             4056622/3539747751 objects misplaced (0.115%)
             2529 active+clean
             33   peering
             31   active+clean+laggy
             26   active+remapped+backfilling
             18   active+clean+scrubbing+deep
             9    unknown
             3    active+remapped+backfill_wait

  io:
    client:   38 KiB/s rd, 0 B/s wr, 37 op/s rd, 25 op/s wr
    recovery: 426 MiB/s, 158 objects/s

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx