objects misplaced jumps up at 5%

Jake Grimmett <jog@xxxxxxxxxxxxxxxxx> · Mon, 28 Sep 2020 10:45:24 +0100

Dear All,

After adding 10 new nodes, each with 10 OSDs to a cluster, we are unable
to get "objects misplaced" back to zero.

The cluster successfully re-balanced from ~35% to 5% misplaced, however
every time "objects misplaced" drops below 5%, a number of pgs start to
backfill, increasing the "objects misplaced" to 5.1%

I do not believe the balancer is active:

[root@ceph7 ceph]# ceph balancer status
{
    "last_optimize_duration": "",
    "plans": [],
    "mode": "upmap",
    "active": false,
    "optimize_result": "",
    "last_optimize_started": ""
}

The cluster has now been stuck at ~5% misplaced for a couple of weeks.
The recovery is using ~1GiB/s bandwidth, and is preventing any scrubs.

The cluster contains 2.6PB of cephfs, that is still read/write usable.
Cluster originally had 10 nodes, each with 45 8TB drives. The new nodes
have 10 x 16TB drives.

To show the cluster before and immediately after an "episode"

***************************************************

[root@ceph7 ceph]# ceph -s
  cluster:
    id:     36ed7113-080c-49b8-80e2-4947cc456f2a
    health: HEALTH_WARN
            7 nearfull osd(s)
            2 pool(s) nearfull
            Low space hindering backfill (add storage if this doesn't
resolve itself): 11 pgs backfill_toofull
            16372 pgs not deep-scrubbed in time
            16372 pgs not scrubbed in time
            1/3 mons down, quorum ceph1b,ceph3b

  services:
    mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b
    mgr: ceph3(active, since 3d), standbys: ceph1
    mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay
    osd: 554 osds: 554 up (since 4d), 554 in (since 5w); 848 remapped pgs

  task status:
    scrub status:
        mds.ceph1: idle
        mds.ceph2: idle

  data:
    pools:   3 pools, 16417 pgs
    objects: 937.39M objects, 2.6 PiB
    usage:   3.2 PiB used, 1.4 PiB / 4.6 PiB avail
    pgs:     467620187/9352502650 objects misplaced (5.000%)
             7893 active+clean
             7294 active+clean+snaptrim_wait
             785  active+remapped+backfill_wait
             382  active+clean+snaptrim
             52   active+remapped+backfilling
             11   active+remapped+backfill_wait+backfill_toofull

  io:
    client:   129 KiB/s rd, 82 MiB/s wr, 3 op/s rd, 53 op/s wr
    recovery: 1.1 GiB/s, 364 objects/s

***************************************************

and then seconds later:

***************************************************

[root@ceph7 ceph]# ceph -s
  cluster:
    id:     36ed7113-080c-49b8-80e2-4947cc456f2a
    health: HEALTH_WARN
            7 nearfull osd(s)
            2 pool(s) nearfull
            Low space hindering backfill (add storage if this doesn't
resolve itself): 11 pgs backfill_toofull
            16372 pgs not deep-scrubbed in time
            16372 pgs not scrubbed in time
            1/3 mons down, quorum ceph1b,ceph3b

  services:
    mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b
    mgr: ceph3(active, since 3d), standbys: ceph1
    mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay
    osd: 554 osds: 554 up (since 5d), 554 in (since 5w); 854 remapped pgs

  task status:
    scrub status:
        mds.ceph1: idle
        mds.ceph2: idle

  data:
    pools:   3 pools, 16417 pgs
    objects: 937.40M objects, 2.6 PiB
    usage:   3.2 PiB used, 1.4 PiB / 4.6 PiB avail
    pgs:     470821753/9352518510 objects misplaced (5.034%)
             7892 active+clean
             7290 active+clean+snaptrim_wait
             791  active+remapped+backfill_wait
             381  active+clean+snaptrim
             52   active+remapped+backfilling
             11   active+remapped+backfill_wait+backfill_toofull

  io:
    client:   155 KiB/s rd, 125 MiB/s wr, 2 op/s rd, 53 op/s wr
    recovery: 969 MiB/s, 330 objects/s

***************************************************

If it helps, I've tried capturing 1/5 debug logs from an OSD.

Not sure, but I think this is the way to follow a thread handling one pg
as it decides to rebalance:

[root@ceph7 ceph]# grep 7f2e569e9700 ceph-osd.312.log | less

2020-09-24 14:44:36.844 7f2e569e9700  1 osd.312 pg_epoch: 106808
pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/10
2671 les/c/f 102672/102672/0 106808/106808/106808)
[148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808
pi=[102671,106808)/1 luod=0'0 crt=106803'6043528 lcod 106801'6043526 active
mbc={} ps=104] start_peering_interval up
[312,424,369,461,546,525,498,169,251,127] ->
[148,508,398,457,256,533,137,469,357,306], acting
[312,424,369,461,546,525,498,169,251,127] -> [148,508,39
8,457,256,533,137,469,357,306], acting_primary 312(0) -> 148, up_primary
312(0) -> 148, role 0 -> -1, features acting 4611087854031667199
upacting 4611087854031667199

2020-09-24 14:44:36.847 7f2e569e9700  1 osd.312 pg_epoch: 106808
pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
les/c/f 102672/102672/0 106808/106808/106808)
[148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808
pi=[102671,106808)/1 crt=106803'6043528 lcod 106801'6043526 unknown
NOTIFY mbc={} ps=104] state<Start>: transitioning to Stray

2020-09-24 14:44:37.792 7f2e569e9700  1 osd.312 pg_epoch: 106809
pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
les/c/f 102672/102672/0 106808/106809/106809)
[148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0)
r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod
106801'6043526 mlcod 0'0 remapped NOTIFY mbc={} ps=104]
start_peering_interval up [148,508,398,457,256,533,137,469,357,306] ->
[148,508,398,457,256,533,137,469,357,306], acting
[148,508,398,457,256,533,137,469,357,306] ->
[312,424,369,461,546,525,498,169,251,127], acting_primary 148(0) -> 312,
up_primary 148(0) -> 148, role -1 -> 0, features acting
4611087854031667199 upacting 4611087854031667199

2020-09-24 14:44:37.793 7f2e569e9700  1 osd.312 pg_epoch: 106809
pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
les/c/f 102672/102672/0 106808/106809/106809)
[148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0)
r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod
106801'6043526 mlcod 0'0 remapped mbc={} ps=104] state<Start>:
transitioning to Primary

2020-09-24 14:44:38.832 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.137(6) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.861 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.148(0) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.879 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.256(4) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.894 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.306(9) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.902 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.357(8) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.912 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.398(2) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.923 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.457(3) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.931 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.469(7) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.938 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.508(1) from (0'0,0'0] MAX to
106803'6043528
2020-09-24 14:44:38.947 7f2e569e9700  0 log_channel(cluster) log [DBG] :
5.157ds0 starting backfill to osd.533(5) from (0'0,0'0] MAX to
106803'6043528

***************************************************

any advice appreciated,

Jake

-- 
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx