Re: rebalancing taking very long time

Sage Weil <sage@xxxxxxxxxxxx> · Wed, 9 Sep 2015 14:20:29 -0700 (PDT)

On Wed, 9 Sep 2015, Vickey Singh wrote:
> Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery and
> rebalancing.
> 
> Here is my Ceph Hammer cluster , which is like this for more than 30 hours.
> 
> You might be thinking about that one OSD which is down and not in.  Its
> intentional, i want to remove that OSD.
> I want the cluster to become healthy again before i remove that OSD.
> 
> Can someone help us with this problem
> 
>  cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
>      health HEALTH_WARN
>             14 pgs stuck unclean
>             5 requests are blocked > 32 sec
>             recovery 420/28358085 objects degraded (0.001%)
>             recovery 199941/28358085 objects misplaced (0.705%)
>             too few PGs per OSD (28 < min 30)
>      monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
> =10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
>             election epoch 1076, quorum 0,1,2 stor0201,stor0202,
> stor0203
>      osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
>       pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
>             150 TB used, 193 TB / 344 TB avail
>             420/28358085 objects degraded (0.001%)
>             199941/28358085 objects misplaced (0.705%)
>                  879 active+clean

>                   14 active+remapped

                       ^^^

This is your problem.  It's not the recovery, it's that CRUSH is only 
mapping to 2 devices for one of your PGs.  This is usually a 
result of the vary_r tunable being 0.  Assuming all of your clients 
are firefly or newer, you can fix it with

 ceph osd crush tunables firefly

Alternatively, you can probably work around the situation by removing any 
'out' OSD from the crush map entirely, in which case

 ceph osd crush rm osd.<id>

will do the trick.

sage

>                    3 active+clean+scrubbing+deep
> 
> 
> 
> On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas <asalas@xxxxxxxxx> wrote:
>       I can say exactly the same I am using ceph sin 0.38 and I never
>       get osd so laggy than with 0.94. rebalancing /rebuild algorithm
>       is crap in 0.94 serriously I have 2 osd serving 2 discs of 2TB
>       and 4 GB of RAM osd takes 1.6GB each !!! serriously ! that makes
>       avanche snow.
> 
>       Let me be straight and explain what changed.
> 
>       in 0.38 you ALWAYS could stop the ceph cluster and then start it
>       up it would evaluate if everyone is back if there is enough
>       replicas then start rebuilding /rebalancing what needed of
>       course like 10 minutes was necesary to bring up ceph cluster but
>       then the rebuilding /rebalancing process was smooth.
>       With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63%
>       over 20 osd. then you get a disc crash. so ceph starts
>       automatically to rebuild and rebalance stuff. and there osd
>       start to lag then to crash
>       you stop ceph cluster you change the drive restart the ceph
>       cluster stops all rebuild process setting no-backfill, norecovey
>       noscrub nodeep-scrub you rm the old osd create a new one wait
>       for all osd
>       to be in and up and then starts rebuilding lag/rebalancing since
>       it is automated not much a choice there.
> 
>       And again all osd are stuck in enless lag/down/recovery intent
>       cycle...
> 
>       It is a pain serriously. 5 days after changing the faulty disc
>       it is still locked in the lag/down/recovery cycle.
> 
>       Sur it can be argued that my machines are really ressource
>       limited and that I should buy 3 thousand dollar worth server at
>       least. But intil 0.72 that rebalancing /rebuilding process was
>       working smoothly on the same hardware.
> 
>       It seems to me that the rebalancing/rebuilding algorithm is more
>       strict now than it was in the past. in the past only what really
>       really needed to be rebuild or rebalance was rebalanced or
>       rebuild.
> 
>       I can still delete all and go back to 0.72... like I should buy
>       a cray T-90 to not have anymore problems and have ceph run
>       smoothly. But this will not help making ceph a better product.
> 
>       for me ceph 0.94 is like windows vista...
> 
>       Alphe Salas
>       I.T ingeneer
> 
>       On 09/08/2015 10:20 AM, Gregory Farnum wrote:
>             On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko
>             <bob@xxxxxxxxxxxx> wrote:
>                   When I lose a disk OR replace a OSD in
>                   my POC ceph cluster, it takes a very
>                   long time to rebalance.  I should note
>                   that my cluster is slightly unique in
>                   that I am using cephfs(shouldn't
>                   matter?) and it currently contains about
>                   310 million objects.
> 
>                   The last time I replaced a disk/OSD was
>                   2.5 days ago and it is still
>                   rebalancing.  This is on a cluster with
>                   no client load.
> 
>                   The configurations is 5 hosts with 6 x
>                   1TB 7200rpm SATA OSD's & 1 850 Pro
>                   SSD which contains the journals for said
>                   OSD's.  Thats means 30 OSD's in
>                   total.  System disk is on its own disk. 
>                   I'm also using a backend network
>                   with single Gb NIC.  THe rebalancing
>                   rate(objects/s) seems to be very slow
>                   when it is close to finishing....say <1%
>                   objects misplaced.
> 
>                   It doesn't seem right that it would take
>                   2+ days to rebalance a 1TB disk
>                   with no load on the cluster.  Are my
>                   expectations off?
> 
> 
>             Possibly...Ceph basically needs to treat each object
>             as a single IO.
>             If you're recovering from a failed disk then you've
>             got to replicate
>             roughly 310 million * 3 / 30 = 31 million objects.
>             If it's perfectly
>             balanced across 30 disks that get 80 IOPS that's
>             12916 seconds (~3.5
>             hours) worth of work just to read each file ? and in
>             reality it's
>             likely to take more than one IO to read the file,
>             and then you have to
>             spend a bunch to write it as well.
> 
> 
>                   I'm not sure if my pg_num/pgp_num needs
>                   to be changed OR the rebalance time
>                   is dependent on the number of objects in
>                   the pool.  These are thoughts i've
>                   had but am not certain are relevant
>                   here.
> 
> 
>             Rebalance time is dependent on the number of objects
>             in the pool. You
>             *might* see an improvement by increasing "osd max
>             push objects" from
>             its default of 10...or you might not. That many
>             small files isn't
>             something I've explored.
>             -Greg
> 
> 
>                   $ sudo ceph -v
>                   ceph version 0.94.3
>                   (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
> 
>                   $ sudo ceph -s
>                   [sudo] password for bababurko:
>                        cluster
>                   f25cb23f-2293-4682-bad2-4b0d8ad10e79
>                         health HEALTH_WARN
>                                5 pgs backfilling
>                                5 pgs stuck unclean
>                                recovery 3046506/676638611
>                   objects misplaced (0.450%)
>                         monmap e1: 3 mons at
> {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.
>                   24.135:6789/0}
>                                election epoch 20, quorum
>                   0,1,2 cephmon01,cephmon02,cephmon03
>                         mdsmap e6070: 1/1/1 up
>                   {0=cephmds01=up:active}, 1 up:standby
>                         osdmap e4395: 30 osds: 30 up, 30
>                   in; 5 remapped pgs
>                          pgmap v3100039: 2112 pgs, 3
>                   pools, 6454 GB data, 321 Mobjects
>                                18319 GB used, 9612 GB /
>                   27931 GB avail
>                                3046506/676638611 objects
>                   misplaced (0.450%)
>                                    2095 active+clean
>                                      12
>                   active+clean+scrubbing+deep
>                                       5
>                   active+remapped+backfilling
>                   recovery io 2294 kB/s, 147 objects/s
> 
>                   $ sudo rados df
>                   pool name                 KB     
>                   objects       clones     degraded
>                   unfound           rd        rd KB       
>                      wr        wr KB
>                   cephfs_data       6767569962   
>                   335746702            0            0
>                   0      2136834            1   
>                   676984208   7052266742
>                   cephfs_metadata        42738     
>                   1058437            0            0
>                   0     16130199  30718800215   
>                   295996938   3811963908
>                   rbd                        0           
>                   0            0            0
>                   0            0            0           
>                   0            0
>                      total used     19209068780   
>                   336805139
>                      total avail    10079469460
>                      total space    29288538240
> 
>                   $ sudo ceph osd pool get cephfs_data
>                   pgp_num
>                   pg_num: 1024
>                   $ sudo ceph osd pool get cephfs_metadata
>                   pgp_num
>                   pg_num: 1024
> 
> 
>                   thanks,
>                   Bob
> 
>                   _______________________________________________
>                   ceph-users mailing list
>                   ceph-users@xxxxxxxxxxxxxx
>                   http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>             _______________________________________________
>             ceph-users mailing list
>             ceph-users@xxxxxxxxxxxxxx
>             http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
>       _______________________________________________
>       ceph-users mailing list
>       ceph-users@xxxxxxxxxxxxxx
>       http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com