Re: objects misplaced jumps up at 5%

Paul Emmerich <emmerich@xxxxxxxxxx> · Mon, 28 Sep 2020 18:35:30 +0200

Hi,

5% misplaced is the default target ratio for misplaced PGs when any
automated rebalancing happens, the sources for this are either the balancer
or pg scaling.
So I'd suspect that there's a PG change ongoing (either pg autoscaler or a
manual change, both obey the target misplaced ratio).
You can check this by running "ceph osd pool ls detail" and check for the
value of pg target.

Also: Looks like you've set osd_scrub_during_recovery = false, this setting
can be annoying on large erasure-coded setups on HDDs that see long
recovery times. It's better to get IO priorities right; search mailing list
for osd op queue cut off high.

Paul

On Mon, Sep 28, 2020 at 11:45 AM Jake Grimmett <jog@xxxxxxxxxxxxxxxxx>
wrote:

> Dear All,
>
>
>
>
>
> After adding 10 new nodes, each with 10 OSDs to a cluster, we are unable
>
>
> to get "objects misplaced" back to zero.
>
>
>
>
>
> The cluster successfully re-balanced from ~35% to 5% misplaced, however
>
>
> every time "objects misplaced" drops below 5%, a number of pgs start to
>
>
> backfill, increasing the "objects misplaced" to 5.1%
>
>
>
>
>
>
>
>
> I do not believe the balancer is active:
>
>
>
>
>
> [root@ceph7 ceph]# ceph balancer status
>
>
> {
>
>
>     "last_optimize_duration": "",
>
>
>     "plans": [],
>
>
>     "mode": "upmap",
>
>
>     "active": false,
>
>
>     "optimize_result": "",
>
>
>     "last_optimize_started": ""
>
>
> }
>
>
>
>
>
>
>
>
> The cluster has now been stuck at ~5% misplaced for a couple of weeks.
>
>
> The recovery is using ~1GiB/s bandwidth, and is preventing any scrubs.
>
>
>
>
>
> The cluster contains 2.6PB of cephfs, that is still read/write usable.
>
>
> Cluster originally had 10 nodes, each with 45 8TB drives. The new nodes
>
>
> have 10 x 16TB drives.
>
>
>
>
>
> To show the cluster before and immediately after an "episode"
>
>
>
>
>
> ***************************************************
>
>
>
>
>
> [root@ceph7 ceph]# ceph -s
>
>
>   cluster:
>
>
>     id:     36ed7113-080c-49b8-80e2-4947cc456f2a
>
>
>     health: HEALTH_WARN
>
>
>             7 nearfull osd(s)
>
>
>             2 pool(s) nearfull
>
>
>             Low space hindering backfill (add storage if this doesn't
>
>
> resolve itself): 11 pgs backfill_toofull
>
>
>             16372 pgs not deep-scrubbed in time
>
>
>             16372 pgs not scrubbed in time
>
>
>             1/3 mons down, quorum ceph1b,ceph3b
>
>
>
>
>
>   services:
>
>
>     mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b
>
>
>     mgr: ceph3(active, since 3d), standbys: ceph1
>
>
>     mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay
>
>
>     osd: 554 osds: 554 up (since 4d), 554 in (since 5w); 848 remapped pgs
>
>
>
>
>
>   task status:
>
>
>     scrub status:
>
>
>         mds.ceph1: idle
>
>
>         mds.ceph2: idle
>
>
>
>
>
>   data:
>
>
>     pools:   3 pools, 16417 pgs
>
>
>     objects: 937.39M objects, 2.6 PiB
>
>
>     usage:   3.2 PiB used, 1.4 PiB / 4.6 PiB avail
>
>
>     pgs:     467620187/9352502650 objects misplaced (5.000%)
>
>
>              7893 active+clean
>
>
>              7294 active+clean+snaptrim_wait
>
>
>              785  active+remapped+backfill_wait
>
>
>              382  active+clean+snaptrim
>
>
>              52   active+remapped+backfilling
>
>
>              11   active+remapped+backfill_wait+backfill_toofull
>
>
>
>
>
>   io:
>
>
>     client:   129 KiB/s rd, 82 MiB/s wr, 3 op/s rd, 53 op/s wr
>
>
>     recovery: 1.1 GiB/s, 364 objects/s
>
>
>
>
>
> ***************************************************
>
>
>
>
>
> and then seconds later:
>
>
>
>
>
> ***************************************************
>
>
>
>
>
> [root@ceph7 ceph]# ceph -s
>
>
>   cluster:
>
>
>     id:     36ed7113-080c-49b8-80e2-4947cc456f2a
>
>
>     health: HEALTH_WARN
>
>
>             7 nearfull osd(s)
>
>
>             2 pool(s) nearfull
>
>
>             Low space hindering backfill (add storage if this doesn't
>
>
> resolve itself): 11 pgs backfill_toofull
>
>
>             16372 pgs not deep-scrubbed in time
>
>
>             16372 pgs not scrubbed in time
>
>
>             1/3 mons down, quorum ceph1b,ceph3b
>
>
>
>
>
>   services:
>
>
>     mon: 3 daemons, quorum ceph1b,ceph3b (age 6d), out of quorum: ceph2b
>
>
>     mgr: ceph3(active, since 3d), standbys: ceph1
>
>
>     mds: cephfs:1 {0=ceph1=up:active} 1 up:standby-replay
>
>
>     osd: 554 osds: 554 up (since 5d), 554 in (since 5w); 854 remapped pgs
>
>
>
>
>
>   task status:
>
>
>     scrub status:
>
>
>         mds.ceph1: idle
>
>
>         mds.ceph2: idle
>
>
>
>
>
>   data:
>
>
>     pools:   3 pools, 16417 pgs
>
>
>     objects: 937.40M objects, 2.6 PiB
>
>
>     usage:   3.2 PiB used, 1.4 PiB / 4.6 PiB avail
>
>
>     pgs:     470821753/9352518510 objects misplaced (5.034%)
>
>
>              7892 active+clean
>
>
>              7290 active+clean+snaptrim_wait
>
>
>              791  active+remapped+backfill_wait
>
>
>              381  active+clean+snaptrim
>
>
>              52   active+remapped+backfilling
>
>
>              11   active+remapped+backfill_wait+backfill_toofull
>
>
>
>
>
>   io:
>
>
>     client:   155 KiB/s rd, 125 MiB/s wr, 2 op/s rd, 53 op/s wr
>
>
>     recovery: 969 MiB/s, 330 objects/s
>
>
>
>
>
>
>
>
> ***************************************************
>
>
>
>
>
> If it helps, I've tried capturing 1/5 debug logs from an OSD.
>
>
>
>
>
> Not sure, but I think this is the way to follow a thread handling one pg
>
>
> as it decides to rebalance:
>
>
>
>
>
> [root@ceph7 ceph]# grep 7f2e569e9700 ceph-osd.312.log | less
>
>
>
>
>
> 2020-09-24 14:44:36.844 7f2e569e9700  1 osd.312 pg_epoch: 106808
>
>
> pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
>
>
> local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/10
>
>
> 2671 les/c/f 102672/102672/0 106808/106808/106808)
>
>
> [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808
>
>
> pi=[102671,106808)/1 luod=0'0 crt=106803'6043528 lcod 106801'6043526 active
>
>
> mbc={} ps=104] start_peering_interval up
>
>
> [312,424,369,461,546,525,498,169,251,127] ->
>
>
> [148,508,398,457,256,533,137,469,357,306], acting
>
>
> [312,424,369,461,546,525,498,169,251,127] -> [148,508,39
>
>
> 8,457,256,533,137,469,357,306], acting_primary 312(0) -> 148, up_primary
>
>
> 312(0) -> 148, role 0 -> -1, features acting 4611087854031667199
>
>
> upacting 4611087854031667199
>
>
>
>
>
> 2020-09-24 14:44:36.847 7f2e569e9700  1 osd.312 pg_epoch: 106808
>
>
> pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
>
>
> local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
>
>
> les/c/f 102672/102672/0 106808/106808/106808)
>
>
> [148,508,398,457,256,533,137,469,357,306]p148(0) r=-1 lpr=106808
>
>
> pi=[102671,106808)/1 crt=106803'6043528 lcod 106801'6043526 unknown
>
>
> NOTIFY mbc={} ps=104] state<Start>: transitioning to Stray
>
>
>
>
>
> 2020-09-24 14:44:37.792 7f2e569e9700  1 osd.312 pg_epoch: 106809
>
>
> pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
>
>
> local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
>
>
> les/c/f 102672/102672/0 106808/106809/106809)
>
>
>
> [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0)
>
>
> r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod
>
>
> 106801'6043526 mlcod 0'0 remapped NOTIFY mbc={} ps=104]
>
>
> start_peering_interval up [148,508,398,457,256,533,137,469,357,306] ->
>
>
> [148,508,398,457,256,533,137,469,357,306], acting
>
>
> [148,508,398,457,256,533,137,469,357,306] ->
>
>
> [312,424,369,461,546,525,498,169,251,127], acting_primary 148(0) -> 312,
>
>
> up_primary 148(0) -> 148, role -1 -> 0, features acting
>
>
> 4611087854031667199 upacting 4611087854031667199
>
>
>
>
>
> 2020-09-24 14:44:37.793 7f2e569e9700  1 osd.312 pg_epoch: 106809
>
>
> pg[5.157ds0( v 106803'6043528 (103919'6040524,106803'6043528]
>
>
> local-lis/les=102671/102672 n=56293 ec=85890/1818 lis/c 102671/102671
>
>
> les/c/f 102672/102672/0 106808/106809/106809)
>
>
>
> [148,508,398,457,256,533,137,469,357,306]/[312,424,369,461,546,525,498,169,251,127]p312(0)
>
>
> r=0 lpr=106809 pi=[102671,106809)/1 crt=106803'6043528 lcod
>
>
> 106801'6043526 mlcod 0'0 remapped mbc={} ps=104] state<Start>:
>
>
> transitioning to Primary
>
>
>
>
>
> 2020-09-24 14:44:38.832 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.137(6) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.861 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.148(0) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.879 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.256(4) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.894 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.306(9) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.902 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.357(8) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.912 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.398(2) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.923 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.457(3) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.931 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.469(7) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.938 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.508(1) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
> 2020-09-24 14:44:38.947 7f2e569e9700  0 log_channel(cluster) log [DBG] :
>
>
> 5.157ds0 starting backfill to osd.533(5) from (0'0,0'0] MAX to
>
>
> 106803'6043528
>
>
>
>
>
> ***************************************************
>
>
>
>
>
> any advice appreciated,
>
>
>
>
>
> Jake
>
>
>
>
>
>
>
>
> --
>
>
> Dr Jake Grimmett
>
>
> Head Of Scientific Computing
>
>
> MRC Laboratory of Molecular Biology
>
>
> Francis Crick Avenue,
>
>
> Cambridge CB2 0QH, UK.
>
>
>
>
>
> _______________________________________________
>
>
> ceph-users mailing list -- ceph-users@xxxxxxx
>
>
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx