Dear All, After adding 10 new nodes, each with 10 OSDs to a cluster, we are unable to get "objects misplaced" back to zero. The cluster successfully re-balanced from ~35% to 5% misplaced, however every time "objects misplaced" drops below 5%, a number of pgs start to backfill, increasing the "objects misplaced" to 5.1% I do not believe the balancer is active: [root@ceph7 ceph]# ceph balancer status { "last_optimize_duration": "", "plans": [], "mode": "upmap", "active": false, "optimize_result": "", "last_optimize_started": "" } The cluster has now been stuck at ~5% misplaced for a couple of weeks. The recovery is using ~1GiB/s bandwidth, and is preventing any scrubs. The cluster contains 2.6PB of cephfs, that is still read/write usable. Cluster originally had 10 nodes, each with 45 8TB drives. The new nodes have 10 x 16TB drives. To show the cluster before and immediately after an "episode" *************************************************** [root@ceph7 ceph]# ceph -s cluster: id: 36ed7113-080c-49b8-80e2-4947cc456f2a health: HEALTH_WARN 7 nearfull osd(s) 2 pool(s) nearfull Low space hindering backfill (add storage if this doesn't resolve itself): 11 pgs backfill_toofull 16372 pgs not deep-scrubbed in time 16372 pgs not scrubbed in time 1/3 mons down, quorum ceph1b,ceph3b services: mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b mgr: ceph3(active, since 3d), standbys: ceph1 mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay osd: 554 osds: 554 up (since 4d), 554 in (since 5w); 848 remapped pgs task status: scrub status: mds.ceph1: idle mds.ceph2: idle data: pools: 3 pools, 16417 pgs objects: 937.39M objects, 2.6 PiB usage: 3.2 PiB used, 1.4 PiB / 4.6 PiB avail pgs: 467620187/9352502650 objects misplaced (5.000%) 7893 active+clean 7294 active+clean+snaptrim_wait 785 active+remapped+backfill_wait 382 active+clean+snaptrim 52 active+remapped+backfilling 11 active+remapped+backfill_wait+backfill_toofull io: client: 129 KiB/s rd, 82 MiB/s wr, 3 op/s rd, 53 op/s wr recovery: 1.1 GiB/s, 364 objects/s *************************************************** and then seconds later: *************************************************** [root@ceph7 ceph]# ceph -s cluster: id: 36ed7113-080c-49b8-80e2-4947cc456f2a health: HEALTH_WARN 7 nearfull osd(s) 2 pool(s) nearfull Low space hindering backfill (add storage if this doesn't resolve itself): 11 pgs backfill_toofull 16372 pgs not deep-scrubbed in time 16372 pgs not scrubbed in time 1/3 mons down, quorum ceph1b,ceph3b services: mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b mgr: ceph3(active, since 3d), standbys: ceph1 mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay osd: 554 osds: 554 up (since 5d), 554 in (since 5w); 854 remapped pgs task status: scrub status: mds.ceph1: idle mds.ceph2: idle data: pools: 3 pools, 16417 pgs objects: 937.40M objects, 2.6 PiB usage: 3.2 PiB used, 1.4 PiB / 4.6 PiB avail pgs: 470821753/9352518510 objects misplaced (5.034%) 7892 active+clean 7290 active+clean+snaptrim_wait 791 active+remapped+backfill_wait 381 active+clean+snaptrim 52 active+remapped+backfilling 11 active+remapped+backfill_wait+backfill_toofull io: client: 155 KiB/s rd, 125 MiB/s wr, 2 op/s rd, 53 op/s wr recovery: 969 MiB/s, 330 objects/s *************************************************** If it helps, I've tried capturing 1/5 debug logs from an OSD. Not sure, but I think this is the way to follow a thread handling one pg as it decides to rebalance: [root@ceph7 ceph]# grep 7f2e569e9700 ceph-osd.312.log | less 2020-09-24 14:44:36.844 7f2e569e9700 1 osd.312 pg_epoch: 106808 pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/10 2671 les/c/f 102672/102672/0 106808/106808/106808) [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808 pi=[102671,106808)/1 luod=0'0 crt=106803'6043528 lcod 106801'6043526 active mbc={} ps=104] start_peering_interval up [312,424,369,461,546,525,498,169,251,127] -> [148,508,398,457,256,533,137,469,357,306], acting [312,424,369,461,546,525,498,169,251,127] -> [148,508,39 8,457,256,533,137,469,357,306], acting_primary 312(0) -> 148, up_primary 312(0) -> 148, role 0 -> -1, features acting 4611087854031667199 upacting 4611087854031667199 2020-09-24 14:44:36.847 7f2e569e9700 1 osd.312 pg_epoch: 106808 pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 les/c/f 102672/102672/0 106808/106808/106808) [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808 pi=[102671,106808)/1 crt=106803'6043528 lcod 106801'6043526 unknown NOTIFY mbc={} ps=104] state<Start>: transitioning to Stray 2020-09-24 14:44:37.792 7f2e569e9700 1 osd.312 pg_epoch: 106809 pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 les/c/f 102672/102672/0 106808/106809/106809) [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0) r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod 106801'6043526 mlcod 0'0 remapped NOTIFY mbc={} ps=104] start_peering_interval up [148,508,398,457,256,533,137,469,357,306] -> [148,508,398,457,256,533,137,469,357,306], acting [148,508,398,457,256,533,137,469,357,306] -> [312,424,369,461,546,525,498,169,251,127], acting_primary 148(0) -> 312, up_primary 148(0) -> 148, role -1 -> 0, features acting 4611087854031667199 upacting 4611087854031667199 2020-09-24 14:44:37.793 7f2e569e9700 1 osd.312 pg_epoch: 106809 pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528] local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671 les/c/f 102672/102672/0 106808/106809/106809) [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0) r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod 106801'6043526 mlcod 0'0 remapped mbc={} ps=104] state<Start>: transitioning to Primary 2020-09-24 14:44:38.832 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.137(6) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.861 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.148(0) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.879 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.256(4) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.894 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.306(9) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.902 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.357(8) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.912 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.398(2) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.923 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.457(3) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.931 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.469(7) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.938 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.508(1) from (0'0,0'0] MAX to 106803'6043528 2020-09-24 14:44:38.947 7f2e569e9700 0 log_channel(cluster) log [DBG] : 5.157ds0 starting backfill to osd.533(5) from (0'0,0'0] MAX to 106803'6043528 *************************************************** any advice appreciated, Jake -- Dr Jake Grimmett Head Of Scientific Computing MRC Laboratory of Molecular Biology Francis Crick Avenue, Cambridge CB2 0QH, UK. _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx