Re: strange backfill delay after outing one node

Wido den Hollander <wido@xxxxxxxx> · Wed, 14 Aug 2019 10:44:28 +0200

On 8/14/19 9:48 AM, Simon Oosthoek wrote:
> Hi all,
> 
> Yesterday I marked out all the osds on one node in our new cluster to
> reconfigure them with WAL/DB on their NVMe devices, but it is taking
> ages to rebalance. The whole cluster (and thus the osds) is only ~1%
> full, therefore the full ratio is nowhere in sight.
> 
> We have 14 osd nodes with 12 disks each, one of them was marked out,
> Yesterday around noon. It is still not completed and all the while, the
> cluster is in ERROR state, even though this is a normal maintenance
> operation.
> 
> We are still experimenting with the cluster, and it is still operational
> while being in ERROR state, however it is slightly worrying when
> considering that it could take even (50x?) longer if the cluster has 50x
> the amount of data. And the OSD's are mostly flatlined in the dashboard
> graphs, so it could potentially do it much faster, I think.
> 
> below are a few outputs of ceph -s(w):
> 
> Yesterday afternoon (~16:00)
> # ceph -w
>   cluster:
>     id:     b489547c-ba50-4745-a914-23eb78e0e5dc
>     health: HEALTH_ERR
>             Degraded data redundancy (low space): 139 pgs backfill_toofull
> 
>   services:
>     mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 4h)
>     mgr: cephmon1(active, since 4h), standbys: cephmon2, cephmon3
>     mds: cephfs:1 {0=cephmds1=up:active} 1 up:standby
>     osd: 168 osds: 168 up (since 3h), 156 in (since 3h); 1588 remapped pgs
>     rgw: 1 daemon active (cephs3.rgw0)
> 
>   data:
>     pools:   12 pools, 4116 pgs
>     objects: 14.04M objects, 11 TiB
>     usage:   20 TiB used, 1.7 PiB / 1.8 PiB avail
>     pgs:     16188696/109408503 objects misplaced (14.797%)
>              2528 active+clean
>              1422 active+remapped+backfill_wait
>              139  active+remapped+backfill_wait+backfill_toofull
>              27   active+remapped+backfilling
> 
>   io:
>     recovery: 205 MiB/s, 198 objects/s
> 
>   progress:
>     Rebalancing after osd.47 marked out
>       [=====================.........]
>     Rebalancing after osd.5 marked out
>       [===================...........]
>     Rebalancing after osd.132 marked out
>       [=====================.........]
>     Rebalancing after osd.90 marked out
>       [=====================.........]
>     Rebalancing after osd.76 marked out
>       [=====================.........]
>     Rebalancing after osd.157 marked out
>       [==================............]
>     Rebalancing after osd.19 marked out
>       [=====================.........]
>     Rebalancing after osd.118 marked out
>       [====================..........]
>     Rebalancing after osd.146 marked out
>       [=================.............]
>     Rebalancing after osd.104 marked out
>       [====================..........]
>     Rebalancing after osd.62 marked out
>       [=======================.......]
>     Rebalancing after osd.33 marked out
>       [======================........]
> 
> 
> This morning:
> # ceph -s
>   cluster:
>     id:     b489547c-ba50-4745-a914-23eb78e0e5dc
>     health: HEALTH_ERR
>             Degraded data redundancy (low space): 8 pgs backfill_toofull
> 
>   services:
>     mon: 3 daemons, quorum cephmon3,cephmon1,cephmon2 (age 22h)
>     mgr: cephmon1(active, since 22h), standbys: cephmon2, cephmon3
>     mds: cephfs:1 {0=cephmds2=up:active} 1 up:standby
>     osd: 168 osds: 168 up (since 22h), 156 in (since 21h); 189 remapped pgs
>     rgw: 1 daemon active (cephs3.rgw0)
> 
>   data:
>     pools:   12 pools, 4116 pgs
>     objects: 14.11M objects, 11 TiB
>     usage:   21 TiB used, 1.7 PiB / 1.8 PiB avail
>     pgs:     4643284/110159565 objects misplaced (4.215%)
>              3927 active+clean
>              162  active+remapped+backfill_wait
>              19   active+remapped+backfilling
>              8    active+remapped+backfill_wait+backfill_toofull
> 
>   io:
>     client:   32 KiB/s rd, 0 B/s wr, 31 op/s rd, 21 op/s wr
>     recovery: 198 MiB/s, 149 objects/s
> 

It is still recovering it seems with 149 objects/second.

>   progress:
>     Rebalancing after osd.47 marked out
>       [=============================.]
>     Rebalancing after osd.5 marked out
>       [=============================.]
>     Rebalancing after osd.132 marked out
>       [=============================.]
>     Rebalancing after osd.90 marked out
>       [=============================.]
>     Rebalancing after osd.76 marked out
>       [=============================.]
>     Rebalancing after osd.157 marked out
>       [=============================.]
>     Rebalancing after osd.19 marked out
>       [=============================.]
>     Rebalancing after osd.146 marked out
>       [=============================.]
>     Rebalancing after osd.104 marked out
>       [=============================.]
>     Rebalancing after osd.62 marked out
>       [=============================.]
> 
> 
> I found some hints, though I'm not sure it's right for us at this url:
> https://forum.proxmox.com/threads/increase-ceph-recovery-speed.36728/
> :
>> ceph tell 'osd.*' injectargs '--osd-max-backfills 16'
>> ceph tell 'osd.*' injectargs '--osd-recovery-max-active 4'
> 
> Since the cluster is currently hardly loaded, backfilling can take up
> all the unused bandwidth as far as I'm concerned...
> 
> Is it a good idea to give the above commands or other commands to speed
> up the backfilling? (e.g. like increasing "osd max backfills")
> 

Yes, as right now the OSDs aren't doing that many backfills. You still
have a large queue of PGs which need to be backfilled.

$ ceph tell osd.* config set osd_max_backfills 5

The default is that only one (1) backfills runs at the same time per
OSD. By setting it to 5 you speed up the process by increasing the
concurrency. This will however add load to the system and thus reduce
the available I/O for clients.

Wido

> Cheers
> 
> /Simon
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com