Re: rebalancing taking very long time

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 8 Sep 2015 14:20:17 +0100

On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
> When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
> long time to rebalance.  I should note that my cluster is slightly unique in
> that I am using cephfs(shouldn't matter?) and it currently contains about
> 310 million objects.
>
> The last time I replaced a disk/OSD was 2.5 days ago and it is still
> rebalancing.  This is on a cluster with no client load.
>
> The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
> SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
> total.  System disk is on its own disk.  I'm also using a backend network
> with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
> when it is close to finishing....say <1% objects misplaced.
>
> It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
> with no load on the cluster.  Are my expectations off?

Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.

>
> I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
> is dependent on the number of objects in the pool.  These are thoughts i've
> had but am not certain are relevant here.

Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg

>
> $ sudo ceph -v
> ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
>
> $ sudo ceph -s
> [sudo] password for bababurko:
>     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
>      health HEALTH_WARN
>             5 pgs backfilling
>             5 pgs stuck unclean
>             recovery 3046506/676638611 objects misplaced (0.450%)
>      monmap e1: 3 mons at
> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
>             election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
>      mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
>      osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
>       pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
>             18319 GB used, 9612 GB / 27931 GB avail
>             3046506/676638611 objects misplaced (0.450%)
>                 2095 active+clean
>                   12 active+clean+scrubbing+deep
>                    5 active+remapped+backfilling
> recovery io 2294 kB/s, 147 objects/s
>
> $ sudo rados df
> pool name                 KB      objects       clones     degraded
> unfound           rd        rd KB           wr        wr KB
> cephfs_data       6767569962    335746702            0            0
> 0      2136834            1    676984208   7052266742
> cephfs_metadata        42738      1058437            0            0
> 0     16130199  30718800215    295996938   3811963908
> rbd                        0            0            0            0
> 0            0            0            0            0
>   total used     19209068780    336805139
>   total avail    10079469460
>   total space    29288538240
>
> $ sudo ceph osd pool get cephfs_data pgp_num
> pg_num: 1024
> $ sudo ceph osd pool get cephfs_metadata pgp_num
> pg_num: 1024
>
>
> thanks,
> Bob
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com