Adding OSD's results in slow ops, inactive PG's

Ruben Vestergaard <rubenv@xxxxxxxx> · Wed, 17 Jan 2024 16:38:46 +0100

Hi

We have a cluster with which currently looks like so:

    services:
      mon: 5 daemons, quorum lazy,jolly,happy,dopey,sleepy (age 13d)
      mgr: jolly.tpgixt(active, since 25h), standbys: dopey.lxajvk, lazy.xuhetq
      mds: 1/1 daemons up, 2 standby
      osd: 449 osds: 425 up (since 15m), 425 in (since 5m); 5104 remapped pgs

    data:
      volumes: 1/1 healthy
      pools:   13 pools, 11153 pgs
      objects: 304.11M objects, 988 TiB
      usage:   1.6 PiB used, 1.4 PiB / 2.9 PiB avail
      pgs:     6/1617270006 objects degraded (0.000%)
               366696947/1617270006 objects misplaced (22.674%)
               6043 active+clean
               5041 active+remapped+backfill_wait
               66   active+remapped+backfilling
               2    active+recovery_wait+degraded+remapped
               1    active+recovering+degraded

It's currently rebalancing after adding a node, but this rebalance has 
been rather slow -- right now it's running 66 backfills, but it seems to 
stabilize around 8 backfills eventually. We figured that perhaps adding 
another node might speed things up.

Immediately upon adding the node, we get slow ops and inactive PG's. 
Removing the new node gets us back in working order.

It turns out that even adding 1 OSD breaks the cluster, and immediately 
sends it here:

    [WRN] PG_DEGRADED: Degraded data redundancy: 6/1617265712 objects degraded (0.000%), 3 pgs degraded
        pg 37.c8 is active+recovery_wait+degraded+remapped, acting [410,163,236,209,7,283,155,143,78]
        pg 37.1a1 is active+recovering+degraded, acting [234,424,163,74,22,128,177,153,181]
        pg 37.1da is active+recovery_wait+degraded+remapped, acting [163,408,230,190,93,284,50,78,44]
    [WRN] SLOW_OPS: 22 slow ops, oldest one blocked for 54 sec, daemons [osd.11,osd.110,osd.112,osd.117,osd.120,osd.123,osd.13,osd.136,osd.144,osd.157]... have slow ops.

The OSD added had number 431, so it does not appear to be the immediate 
cause of the slow ops, however, removing 431 immediately clears the 
problem.

We thought we might be experiencing 'Crush giving up too soon' symptoms 
[1], as we have seen similar behaviour on another pool, but it does not 
appear to be the case here. We went through the motions described on the 
page and everything looked OK.

At least one pool which stops working is a 4+2 EC pool, placed on 
spinning rust, some 200-ish disks distributed across 13 nodes. I'm not 
sure if other pools break, but that particular 4+2 EC pool is rather 
important so I'm a little wary of experimenting blindly.

Any thoughts on where to look next?

Thanks,
Ruben Vestergaard

[1] https://docs.ceph.com/en/reef/rados/troubleshooting/troubleshooting-pg/#crush-gives-up-too-soon
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx