Re: Rebuilding data resiliency after adding new OSD's stuck for so long at 5%

ceph@xxxxxxxxxx · Wed, 13 Sep 2023 21:28:39 +0200

Hi,

As long as you see changes and "recovery" it will make progress so i guess you have just to wait...

What kind of disks did you add?

Hth
Mehmet

Am 12. September 2023 20:37:56 MESZ schrieb sharathvuthpala@xxxxxxxxx:
>We have a user-provisioned instance( Bare Metal Installation) of OpenShift cluster running on version 4.12 and we are using OpenShift Data Foundation as the Storage System. Earlier we had 3 disks attached to the storage system and 3 OSDs were available in the cluster. Today, while adding additional disks to the storage cluster, we increased the number of disks from 3 to 9, that is 3 per node. The addition of storage capacity was successful, resulting in 6 new OSDs in the cluster. 
>
>But,  after this operation, we noticed that Rebuilding Data Resiliency is stuck at 5% and not moving forward. At the same time, ceph status shows 65% of objects are misplaced and PGs are not in active+clean state. 
>
>Here is more information about the ceph cluster:
>
>
>sh-4.4$ ceph status
>  cluster:
>    id:     18bf836d-4937-4925-b964-7a026c1d548d
>    health: HEALTH_OK
>
>  services:
>    mon: 3 daemons, quorum b,u,v (age 2w)
>    mgr: a(active, since 7w)
>    mds: 1/1 daemons up, 1 hot standby
>    osd: 9 osds: 9 up (since 5h), 9 in (since 5h); 191 remapped pgs
>    rgw: 1 daemon active (1 hosts, 1 zones)
>
>  data:
>    volumes: 1/1 healthy
>    pools:   12 pools, 305 pgs
>    objects: 2.69M objects, 2.9 TiB
>    usage:   8.8 TiB used, 27 TiB / 36 TiB avail
>    pgs:     4723077/8079717 objects misplaced (58.456%)
>             188 active+remapped+backfill_wait
>             114 active+clean
>             3   active+remapped+backfilling
>
>  io:
>    client:   679 KiB/s rd, 11 MiB/s wr, 13 op/s rd, 622 op/s wr
>    recovery: 20 MiB/s, 89 keys/s, 22 objects/s
>
>
>sh-4.4$ ceph balancer status
>{
>    "active": true,
>    "last_optimize_duration": "0:00:00.000276",
>    "last_optimize_started": "Tue Sep 12 17:36:03 2023",
>    "mode": "upmap",
>    "optimize_result": "Too many objects (0.581933 > 0.050000) are misplaced; try again later",
>    "plans": []
>}
>
>One more thing we observed is that the number of misplaced objects is decreasing and also there is a drop in the percentage. What might be the reason behind Rebuilding Data Resiliency is not moving forward?
>
>Any inputs would be appreciated.
>
>Thanks
>_______________________________________________
>ceph-users mailing list -- ceph-users@xxxxxxx
>To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx