Hi, 5% misplaced is the default target ratio for misplaced PGs when any automated rebalancing happens, the sources for this are either the balancer or pg scaling. So I'd suspect that there's a PG change ongoing (either pg autoscaler or a manual change, both obey the target misplaced ratio). You can check this by running "ceph osd pool ls detail" and check for the value of pg target. Also: Looks like you've set osd_scrub_during_recovery = false, this setting can be annoying on large erasure-coded setups on HDDs that see long recovery times. It's better to get IO priorities right; search mailing list for osd op queue cut off high. Paul On Mon, Sep 28, 2020 at 11:45 AM Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> wrote: > Dear All, > > > > > > After adding 10 new nodes, each with 10 OSDs to a cluster, we are unable > > > to get "objects misplaced" back to zero. > > > > > > The cluster successfully re-balanced from ~35% to 5% misplaced, however > > > every time "objects misplaced" drops below 5%, a number of pgs start to > > > backfill, increasing the "objects misplaced" to 5.1% > > > > > > > > > I do not believe the balancer is active: > > > > > > [root@ceph7 ceph]# ceph balancer status > > > { > > > "last_optimize_duration": "", > > > "plans": [], > > > "mode": "upmap", > > > "active": false, > > > "optimize_result": "", > > > "last_optimize_started": "" > > > } > > > > > > > > > The cluster has now been stuck at ~5% misplaced for a couple of weeks. > > > The recovery is using ~1GiB/s bandwidth, and is preventing any scrubs. > > > > > > The cluster contains 2.6PB of cephfs, that is still read/write usable. > > > Cluster originally had 10 nodes, each with 45 8TB drives. The new nodes > > > have 10 x 16TB drives. > > > > > > To show the cluster before and immediately after an "episode" > > > > > > *************************************************** > > > > > > [root@ceph7 ceph]# ceph -s > > > cluster: > > > id: 36ed7113-080c-49b8-80e2-4947cc456f2a > > > health: HEALTH_WARN > > > 7 nearfull osd(s) > > > 2 pool(s) nearfull > > > Low space hindering backfill (add storage if this doesn't > > > resolve itself): 11 pgs backfill_toofull > > > 16372 pgs not deep-scrubbed in time > > > 16372 pgs not scrubbed in time > > > 1/3 mons down, quorum ceph1b,ceph3b > > > > > > services: > > > mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b > > > mgr: ceph3(active, since 3d), standbys: ceph1 > > > mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay > > > osd: 554 osds: 554 up (since 4d), 554 in (since 5w); 848 remapped pgs > > > > > > task status: > > > scrub status: > > > mds.ceph1: idle > > > mds.ceph2: idle > > > > > > data: > > > pools: 3 pools, 16417 pgs > > > objects: 937.39M objects, 2.6 PiB > > > usage: 3.2 PiB used, 1.4 PiB / 4.6 PiB avail > > > pgs: 467620187/9352502650 objects misplaced (5.000%) > > > 7893 active+clean > > > 7294 active+clean+snaptrim_wait > > > 785 active+remapped+backfill_wait > > > 382 active+clean+snaptrim > > > 52 active+remapped+backfilling > > > 11 active+remapped+backfill_wait+backfill_toofull > > > > > > io: > > > client: 129 KiB/s rd, 82 MiB/s wr, 3 op/s rd, 53 op/s wr > > > recovery: 1.1 GiB/s, 364 objects/s > > > > > > *************************************************** > > > > > > and then seconds later: > > > > > > *************************************************** > > > > > > [root@ceph7 ceph]# ceph -s > > > cluster: > > > id: 36ed7113-080c-49b8-80e2-4947cc456f2a > > > health: HEALTH_WARN > > > 7 nearfull osd(s) > > > 2 pool(s) nearfull > > > Low space hindering backfill (add storage if this doesn't > > > resolve itself): 11 pgs backfill_toofull > > > 16372 pgs not deep-scrubbed in time > > > 16372 pgs not scrubbed in time > > > 1/3 mons down, quorum ceph1b,ceph3b > > > > > > services: > > > mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b > > > mgr: ceph3(active, since 3d), standbys: ceph1 > > > mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay > > > osd: 554 osds: 554 up (since 5d), 554 in (since 5w); 854 remapped pgs > > > > > > task status: > > > scrub status: > > > mds.ceph1: idle > > > mds.ceph2: idle > > > > > > data: > > > pools: 3 pools, 16417 pgs > > > objects: 937.40M objects, 2.6 PiB > > > usage: 3.2 PiB used, 1.4 PiB / 4.6 PiB avail > > > pgs: 470821753/9352518510 objects misplaced (5.034%) > > > 7892 active+clean > > > 7290 active+clean+snaptrim_wait > > > 791 active+remapped+backfill_wait > > > 381 active+clean+snaptrim > > > 52 active+remapped+backfilling > > > 11 active+remapped+backfill_wait+backfill_toofull > > > > > > io: > > > client: 155 KiB/s rd, 125 MiB/s wr, 2 op/s rd, 53 op/s wr > > > recovery: 969 MiB/s, 330 objects/s > > > > > > > > > *************************************************** > > > > > > If it helps, I've tried capturing 1/5 debug logs from an OSD. > > > > > > Not sure, but I think this is the way to follow a thread handling one pg > > > as it decides to rebalance: > > > > > > [root@ceph7 ceph]# grep 7f2e569e9700 ceph-osd.312.log | less > > > > > > 2020-09-24 14:44:36.844 7f2e569e9700 1 osd.312 pg_epoch: 106808 > > > pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] > > > local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/10 > > > 2671 les/c/f 102672/102672/0 106808/106808/106808) > > > [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808 > > > pi=[102671,106808)/1 luod=0'0 crt=106803'6043528 lcod 106801'6043526 active > > > mbc={} ps=104] start_peering_interval up > > > [312,424,369,461,546,525,498,169,251,127] -> > > > [148,508,398,457,256,533,137,469,357,306], acting > > > [312,424,369,461,546,525,498,169,251,127] -> [148,508,39 > > > 8,457,256,533,137,469,357,306], acting_primary 312(0) -> 148, up_primary > > > 312(0) -> 148, role 0 -> -1, features acting 4611087854031667199 > > > upacting 4611087854031667199 > > > > > > 2020-09-24 14:44:36.847 7f2e569e9700 1 osd.312 pg_epoch: 106808 > > > pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] > > > local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 > > > les/c/f 102672/102672/0 106808/106808/106808) > > > [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808 > > > pi=[102671,106808)/1 crt=106803'6043528 lcod 106801'6043526 unknown > > > NOTIFY mbc={} ps=104] state<Start>: transitioning to Stray > > > > > > 2020-09-24 14:44:37.792 7f2e569e9700 1 osd.312 pg_epoch: 106809 > > > pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] > > > local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 > > > les/c/f 102672/102672/0 106808/106809/106809) > > > > [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0) > > > r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod > > > 106801'6043526 mlcod 0'0 remapped NOTIFY mbc={} ps=104] > > > start_peering_interval up [148,508,398,457,256,533,137,469,357,306] -> > > > [148,508,398,457,256,533,137,469,357,306], acting > > > [148,508,398,457,256,533,137,469,357,306] -> > > > [312,424,369,461,546,525,498,169,251,127], acting_primary 148(0) -> 312, > > > up_primary 148(0) -> 148, role -1 -> 0, features acting > > > 4611087854031667199 upacting 4611087854031667199 > > > > > > 2020-09-24 14:44:37.793 7f2e569e9700 1 osd.312 pg_epoch: 106809 > > > pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] > > > local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 > > > les/c/f 102672/102672/0 106808/106809/106809) > > > > [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0) > > > r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod > > > 106801'6043526 mlcod 0'0 remapped mbc={} ps=104] state<Start>: > > > transitioning to Primary > > > > > > 2020-09-24 14:44:38.832 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.137(6) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.861 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.148(0) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.879 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.256(4) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.894 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.306(9) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.902 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.357(8) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.912 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.398(2) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.923 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.457(3) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.931 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.469(7) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.938 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.508(1) from (0'0,0'0] MAX to > > > 106803'6043528 > > > 2020-09-24 14:44:38.947 7f2e569e9700 0 log_channel(cluster) log [DBG] : > > > 5.157ds0 starting backfill to osd.533(5) from (0'0,0'0] MAX to > > > 106803'6043528 > > > > > > *************************************************** > > > > > > any advice appreciated, > > > > > > Jake > > > > > > > > > -- > > > Dr Jake Grimmett > > > Head Of Scientific Computing > > > MRC Laboratory of Molecular Biology > > > Francis Crick Avenue, > > > Cambridge CB2 0QH, UK. > > > > > > _______________________________________________ > > > ceph-users mailing list -- ceph-users@xxxxxxx > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx