Dallas; It looks to me like you will need to wait until data movement naturally resolves the near-full issue. So long as you continue to have this: io: recovery: 477 KiB/s, 330 keys/s, 29 objects/s the cluster is working. That said, there are some things you can do. 1) The near-full ratio is configurable. I don't have those commands immediately to hand, but Googling, or searching archives of this list should show you have to change this value from its default of 80%. Make sure you set it back when the data movement is complete, or almost complete. You need to be careful with this, as ceph will happily run up to the new near-full ratio, and error again. You also need to keep track of the other full ratios (I believe there are 2 others). 2) Adjust performance settings to allow the data movement to go faster. Again, I don't have those setting immediately to hand, but Googling something like 'ceph recovery tuning,' or searching this list, should point you in the right direction. Notice that you only have 6 PGs trying to move at a time, with 2 blocked on your near-full OSDs (8 & 19). I believe; by default, each OSD daemon is only involved in 1 data movement at a time. The tradeoff here is user activity suffers if you adjust to favor recovery, however, with the cluster in ERROR status, I suspect user activity is already suffering. Thank you, Dominic L. Hilsbos, MBA Director – Information Technology Perform Air International Inc. DHilsbos@xxxxxxxxxxxxxx www.PerformAir.com -----Original Message----- From: Dallas Jones [mailto:djones@xxxxxxxxxxxxxxxxx] Sent: Thursday, August 27, 2020 9:02 AM To: ceph-users@xxxxxxx Subject: Re: Cluster degraded after adding OSDs to increase capacity The new drives are larger capacity than the first drives I added to the cluster, but they're all SAS HDDs. cephuser@ceph01:~$ ceph osd df tree ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS TYPE NAME -1 122.79410 - 123 TiB 42 TiB 41 TiB 217 GiB 466 GiB 81 TiB 33.86 1.00 - root default -3 40.93137 - 41 TiB 14 TiB 14 TiB 72 GiB 154 GiB 27 TiB 33.86 1.00 - host ceph01 0 hdd 2.72849 0.95001 2.7 TiB 2.2 TiB 2.1 TiB 7.4 GiB 24 GiB 569 GiB 79.64 2.35 218 up osd.0 1 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.0 TiB 7.6 GiB 23 GiB 694 GiB 75.16 2.22 196 up osd.1 2 hdd 2.72849 1.00000 2.7 TiB 1.6 TiB 1.6 TiB 8.8 GiB 18 GiB 1.1 TiB 60.39 1.78 199 up osd.2 3 hdd 2.72849 0.95001 2.7 TiB 2.2 TiB 2.1 TiB 8.3 GiB 23 GiB 583 GiB 79.13 2.34 202 up osd.3 4 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.0 TiB 8.4 GiB 22 GiB 692 GiB 75.22 2.22 214 up osd.4 5 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 8.5 GiB 19 GiB 1.0 TiB 62.39 1.84 195 up osd.5 6 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 8.5 GiB 21 GiB 709 GiB 74.62 2.20 217 up osd.6 22 hdd 5.45799 1.00000 5.5 TiB 4.2 GiB 165 MiB 2.0 GiB 2.1 GiB 5.5 TiB 0.08 0.00 23 up osd.22 23 hdd 5.45799 1.00000 5.5 TiB 2.7 GiB 161 MiB 1.5 GiB 1.0 GiB 5.5 TiB 0.05 0.00 23 up osd.23 27 hdd 5.45799 1.00000 5.5 TiB 23 GiB 17 GiB 5.0 GiB 1.3 GiB 5.4 TiB 0.42 0.01 63 up osd.27 28 hdd 5.45799 1.00000 5.5 TiB 10 GiB 2.8 GiB 6.0 GiB 1.3 GiB 5.4 TiB 0.18 0.01 82 up osd.28 -5 40.93137 - 41 TiB 14 TiB 14 TiB 71 GiB 157 GiB 27 TiB 33.89 1.00 - host ceph02 7 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.1 TiB 9.6 GiB 23 GiB 652 GiB 76.66 2.26 221 up osd.7 8 hdd 2.72849 0.95001 2.7 TiB 2.4 TiB 2.4 TiB 7.6 GiB 26 GiB 308 GiB 88.98 2.63 220 up osd.8 9 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.0 TiB 8.5 GiB 23 GiB 679 GiB 75.71 2.24 214 up osd.9 10 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 1.9 TiB 7.5 GiB 21 GiB 777 GiB 72.18 2.13 208 up osd.10 11 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 2.0 TiB 6.1 GiB 22 GiB 752 GiB 73.10 2.16 191 up osd.11 12 hdd 2.72849 1.00000 2.7 TiB 1.5 TiB 1.5 TiB 9.1 GiB 18 GiB 1.2 TiB 56.45 1.67 188 up osd.12 13 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 7.9 GiB 19 GiB 1024 GiB 63.37 1.87 193 up osd.13 25 hdd 5.45799 1.00000 5.5 TiB 4.9 GiB 165 MiB 3.7 GiB 1.0 GiB 5.5 TiB 0.09 0.00 42 up osd.25 26 hdd 5.45799 1.00000 5.5 TiB 2.9 GiB 157 MiB 1.6 GiB 1.2 GiB 5.5 TiB 0.05 0.00 26 up osd.26 29 hdd 5.45799 1.00000 5.5 TiB 24 GiB 18 GiB 4.2 GiB 1.2 GiB 5.4 TiB 0.43 0.01 58 up osd.29 30 hdd 5.45799 1.00000 5.5 TiB 21 GiB 14 GiB 5.6 GiB 1.3 GiB 5.4 TiB 0.38 0.01 71 up osd.30 -7 40.93137 - 41 TiB 14 TiB 14 TiB 73 GiB 156 GiB 27 TiB 33.83 1.00 - host ceph03 14 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.1 TiB 6.9 GiB 23 GiB 627 GiB 77.56 2.29 205 up osd.14 15 hdd 2.72849 1.00000 2.7 TiB 2.0 TiB 1.9 TiB 6.8 GiB 21 GiB 793 GiB 71.62 2.12 189 up osd.15 16 hdd 2.72849 1.00000 2.7 TiB 1.9 TiB 1.9 TiB 8.7 GiB 21 GiB 813 GiB 70.89 2.09 209 up osd.16 17 hdd 2.72849 1.00000 2.7 TiB 2.1 TiB 2.1 TiB 8.6 GiB 23 GiB 609 GiB 78.19 2.31 216 up osd.17 18 hdd 2.72849 1.00000 2.7 TiB 1.7 TiB 1.7 TiB 9.1 GiB 19 GiB 1.0 TiB 62.40 1.84 209 up osd.18 19 hdd 2.72849 0.95001 2.7 TiB 2.2 TiB 2.2 TiB 9.1 GiB 24 GiB 541 GiB 80.65 2.38 210 up osd.19 20 hdd 2.72849 1.00000 2.7 TiB 1.8 TiB 1.8 TiB 8.4 GiB 19 GiB 969 GiB 65.32 1.93 200 up osd.20 21 hdd 5.45799 1.00000 5.5 TiB 3.7 GiB 161 MiB 2.2 GiB 1.3 GiB 5.5 TiB 0.07 0.00 28 up osd.21 24 hdd 5.45799 1.00000 5.5 TiB 4.9 GiB 177 MiB 3.6 GiB 1.1 GiB 5.5 TiB 0.09 0.00 37 up osd.24 31 hdd 5.45799 1.00000 5.5 TiB 8.9 GiB 2.7 GiB 5.0 GiB 1.2 GiB 5.4 TiB 0.16 0.00 59 up osd.31 32 hdd 5.45799 1.00000 5.5 TiB 6.0 GiB 182 MiB 4.7 GiB 1.1 GiB 5.5 TiB 0.11 0.00 70 up osd.32 TOTAL 123 TiB 42 TiB 41 TiB 217 GiB 466 GiB 81 TiB 33.86 MIN/MAX VAR: 0.00/2.63 STDDEV: 37.27 On Thu, Aug 27, 2020 at 8:43 AM Eugen Block <eblock@xxxxxx> wrote: > Hi, > > are the new OSDs in the same root and is it the same device class? Can > you share the output of ‚ceph osd df tree‘? > > > Zitat von Dallas Jones <djones@xxxxxxxxxxxxxxxxx>: > > > My 3-node Ceph cluster (14.2.4) has been running fine for months. > However, > > my data pool became close to full a couple of weeks ago, so I added 12 > new > > OSDs, roughly doubling the capacity of the cluster. However, the pool > size > > has not changed, and the health of the cluster has changed for the worse. > > The dashboard shows the following cluster status: > > > > - PG_DEGRADED_FULL: Degraded data redundancy (low space): 2 pgs > > backfill_toofull > > - POOL_NEARFULL: 6 pool(s) nearfull > > - OSD_NEARFULL: 1 nearfull osd(s) > > > > Output from ceph -s: > > > > cluster: > > id: e5a47160-a302-462a-8fa4-1e533e1edd4e > > health: HEALTH_ERR > > 1 nearfull osd(s) > > 6 pool(s) nearfull > > Degraded data redundancy (low space): 2 pgs backfill_toofull > > > > services: > > mon: 3 daemons, quorum ceph01,ceph02,ceph03 (age 5w) > > mgr: ceph01(active, since 4w), standbys: ceph03, ceph02 > > mds: cephfs:1 {0=ceph01=up:active} 2 up:standby > > osd: 33 osds: 33 up (since 43h), 33 in (since 43h); 1094 remapped pgs > > rgw: 3 daemons active (ceph01, ceph02, ceph03) > > > > data: > > pools: 6 pools, 1632 pgs > > objects: 134.50M objects, 7.8 TiB > > usage: 42 TiB used, 81 TiB / 123 TiB avail > > pgs: 213786007/403501920 objects misplaced (52.983%) > > 1088 active+remapped+backfill_wait > > 538 active+clean > > 4 active+remapped+backfilling > > 2 active+remapped+backfill_wait+backfill_toofull > > > > io: > > recovery: 477 KiB/s, 330 keys/s, 29 objects/s > > > > Can someone steer me in the right direction for how to get my cluster > > healthy again? > > > > Thanks in advance! > > > > -Dallas > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx