Hi, Stuck activating could be an old known issue: if the cluster has many (>100) PGs per OSD, they may temporarily need to hold more than the max (300) and therefore PGs get stuck activating. We always use this option as a workaround: osd max pg per osd hard ratio = 10.0 I suggest giving this a try -- it can't hurt much. Cheers, Dan On Wed, Jun 23, 2021 at 4:29 PM Justin Goetz <jgoetz@xxxxxxxxxxxxxx> wrote: > > Hello! > > We are in the process of expanding our CEPH cluster (by both adding OSD > hosts and replacing smaller-sized HDDs on our existing hosts). So far we > have gone host by host, removing the old OSDs, swapping the physical > HDDs, and re-adding them. This process has gone smooth, aside from one > issue: upon any action taken on the cluster (adding new OSDs, replacing > old, etc), we have PGs get stuck "activating"which causes around 3.5% of > PGs go inactive, causing IO to stop. > > Here is a current look at our ceph -s command: > > cluster: > id: e8ffe2eb-f8fc-4110-a4bc-1715e878fb7b > health: HEALTH_WARN > Reduced data availability: 166 pgs inactive > Degraded data redundancy: 137153907/3658405707 objects > degraded (3.749%), 930 pgs degraded, 928 pgs undersized > 10 pgs not deep-scrubbed in time > 33709 slow ops, oldest one blocked for 35956 sec, daemons > [osd.103,osd.104,osd.105,osd.106,osd.107,osd.109,osd.111,osd.112,osd.113,osd.114]... > have slow ops. > > services: > mon: 3 daemons, quorum lb3,lb2,lb1 (age 8w) > mgr: lb1(active, since 6w), standbys: lb3, lb2 > osd: 117 osds: 117 up (since 15m), 117 in (since 10h); 2033 > remapped pgs > rgw: 3 daemons active (lb1.rgw0, lb2.rgw0, lb3.rgw0) > > task status: > > data: > pools: 8 pools, 5793 pgs > objects: 609.74M objects, 169 TiB > usage: 308 TiB used, 430 TiB / 738 TiB avail > pgs: 2.866% pgs not active > 137153907/3658405707 objects degraded (3.749%) > 262215404/3658405707 objects misplaced (7.167%) > 3754 active+clean > 963 active+remapped+backfill_wait > 892 active+undersized+degraded+remapped+backfill_wait > 136 activating+remapped > 27 activating+undersized+degraded+remapped > 8 active+undersized+degraded+remapped+backfilling > 6 active+clean+scrubbing+deep > 3 activating+degraded+remapped > 3 active+remapped+backfilling > 1 active+undersized+remapped+backfill_wait > > io: > client: 94 KiB/s rd, 94 op/s rd, 0 op/s wr > recovery: 112 MiB/s, 372 objects/s > > progress: > Rebalancing after osd.20 marked in (10h) > [............................] (remaining: 11d) > Rebalancing after osd.41 marked in (10h) > [=...........................] (remaining: 8d) > Rebalancing after osd.30 marked in (10h) > [=...........................] (remaining: 9d) > Rebalancing after osd.1 marked in (10h) > [=======.....................] (remaining: 2h) > Rebalancing after osd.10 marked in (10h) > [............................] (remaining: 12d) > Rebalancing after osd.50 marked in (10h) > [............................] (remaining: 2w) > Rebalancing after osd.71 marked out (10h) > [==..........................] (remaining: 5d) > > What you may find interesting is the "slow ops" warnings. This is where > our inactive PGs become stuck. Once the cluster gets into this state, > I'm able to recover IO usually by restarting the OSDs with slow ops. > However, what's extremely strange, is this workaround only works after > about 12 hours since the last OSD addition. Restarting the slow ops OSDs > before roughly 12 hours results in the slow ops returning immediately. > > Our first thought was hardware issues, however we ruled this out after > the slow ops warnings appeared on brand new HDDs and OSD hosts. > Monitoring the IO saturation of the OSDs reporting slow ops shows actual > usage nowhere near saturation, and no hardware issues are present on the > drives themselves. > > Looking at the journalctl logs of one of the affected OSDs above, we see > the following repeated multiple times: > > osd.103 56934 get_health_metrics reporting 2 slow ops, oldest is > osd_op(client.467952.0:1520304537 8.6fbs0 8.1e6826fb (undecoded) > ondisk+retry+write+known_if_redirected e56923 > > So far my procedure for the disk swaps have been as follows: > > 1. Set noout,norebalance, and norecover on the cluster. > 2. Use ceph-ansible to remove the old disk OSD IDs > 3. Swap physical HDDs, re-add with ceph-ansible > 4. Unset noout,norebalance,norecover > > I should note this issue appears even with simple OSD additions (not > removals), as we added 2 brand new hosts to the cluster and saw the same > issue. > > I've been trying to think of any possible cause of this issue, I should > mention our cluster is messy at the moment hardware-wise (we have a mix > of 7T HDDs, 4T HDDs, and 10T HDDs - moving to all 10T HDDs but the > process to swap has been taking a while). One warning I've noticed > during the old disk removals is a warning about too many PGs per OSD, > however this warning clears once the new OSDs are added, which is to be > expected I assume. > > If anyone would be willing to provide any hints of where to look, it > would be much appreciated! > > Thanks for your time. > -- > > Justin Goetz > Systems Engineer, TeraSwitch Inc. > jgoetz@xxxxxxxxxxxxxx > 412-945-7045 (NOC) | 412-459-7945 (Direct) > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx