Rebuilding data resiliency after adding new OSD's stuck for so long at 5%

sharathvuthpala@xxxxxxxxx · Tue, 12 Sep 2023 18:37:56 -0000

We have a user-provisioned instance( Bare Metal Installation) of OpenShift cluster running on version 4.12 and we are using OpenShift Data Foundation as the Storage System. Earlier we had 3 disks attached to the storage system and 3 OSDs were available in the cluster. Today, while adding additional disks to the storage cluster, we increased the number of disks from 3 to 9, that is 3 per node. The addition of storage capacity was successful, resulting in 6 new OSDs in the cluster. 

But,  after this operation, we noticed that Rebuilding Data Resiliency is stuck at 5% and not moving forward. At the same time, ceph status shows 65% of objects are misplaced and PGs are not in active+clean state. 

Here is more information about the ceph cluster:

sh-4.4$ ceph status
  cluster:
    id:     18bf836d-4937-4925-b964-7a026c1d548d
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum b,u,v (age 2w)
    mgr: a(active, since 7w)
    mds: 1/1 daemons up, 1 hot standby
    osd: 9 osds: 9 up (since 5h), 9 in (since 5h); 191 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/1 healthy
    pools:   12 pools, 305 pgs
    objects: 2.69M objects, 2.9 TiB
    usage:   8.8 TiB used, 27 TiB / 36 TiB avail
    pgs:     4723077/8079717 objects misplaced (58.456%)
             188 active+remapped+backfill_wait
             114 active+clean
             3   active+remapped+backfilling

  io:
    client:   679 KiB/s rd, 11 MiB/s wr, 13 op/s rd, 622 op/s wr
    recovery: 20 MiB/s, 89 keys/s, 22 objects/s

sh-4.4$ ceph balancer status
{
    "active": true,
    "last_optimize_duration": "0:00:00.000276",
    "last_optimize_started": "Tue Sep 12 17:36:03 2023",
    "mode": "upmap",
    "optimize_result": "Too many objects (0.581933 > 0.050000) are misplaced; try again later",
    "plans": []
}

One more thing we observed is that the number of misplaced objects is decreasing and also there is a drop in the percentage. What might be the reason behind Rebuilding Data Resiliency is not moving forward?

Any inputs would be appreciated.

Thanks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx