Re: rebalancing taking very long time

Vickey Singh <vickey.singh22693@xxxxxxxxx> · Wed, 9 Sep 2015 21:03:28 +0300

Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery and rebalancing.

Here is my Ceph Hammer cluster , which is like this for more than 30 hours.

You might be thinking about that one OSD which is down and not in.  Its intentional, i want to remove that OSD.
I want the cluster to become healthy again before i remove that OSD.

Can someone help us with this problem

 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
     health HEALTH_WARN
            14 pgs stuck unclean
            5 requests are blocked > 32 sec
            recovery 420/28358085 objects degraded (0.001%)
            recovery 199941/28358085 objects misplaced (0.705%)
            too few PGs per OSD (28 < min 30)
     monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
=10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
            election epoch 1076, quorum 0,1,2 stor0201,stor0202,
stor0203
     osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
      pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
            150 TB used, 193 TB / 344 TB avail
            420/28358085 objects degraded (0.001%)
            199941/28358085 objects misplaced (0.705%)
                 879 active+clean
                  14 active+remapped
                   3 active+clean+scrubbing+deep

On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas <asalas@xxxxxxxxx> wrote:
I can say exactly the same I am using ceph sin 0.38 and I never get osd so laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94 serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes 1.6GB each !!! serriously ! that makes avanche snow.

Let me be straight and explain what changed.

in 0.38 you ALWAYS could stop the ceph cluster and then start it up it would evaluate if everyone is back if there is enough replicas then start rebuilding /rebalancing what needed of course like 10 minutes was necesary to bring up ceph cluster but then the rebuilding /rebalancing process was smooth.

With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20 osd. then you get a disc crash. so ceph starts automatically to rebuild and rebalance stuff. and there osd start to lag then to crash

you stop ceph cluster you change the drive restart the ceph cluster stops all rebuild process setting no-backfill, norecovey noscrub nodeep-scrub you rm the old osd create a new one wait for all osd

to be in and up and then starts rebuilding lag/rebalancing since it is automated not much a choice there.

And again all osd are stuck in enless lag/down/recovery intent cycle...

It is a pain serriously. 5 days after changing the faulty disc it is still locked in the lag/down/recovery cycle.

Sur it can be argued that my machines are really ressource limited and that I should buy 3 thousand dollar worth server at least. But intil 0.72 that rebalancing /rebuilding process was working smoothly on the same hardware.

It seems to me that the rebalancing/rebuilding algorithm is more strict now than it was in the past. in the past only what really really needed to be rebuild or rebalance was rebalanced or rebuild.

I can still delete all and go back to 0.72... like I should buy a cray T-90 to not have anymore problems and have ceph run smoothly. But this will not help making ceph a better product.

for me ceph 0.94 is like windows vista...

Alphe Salas

I.T ingeneer

On 09/08/2015 10:20 AM, Gregory Farnum wrote:

On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:

When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very

long time to rebalance.  I should note that my cluster is slightly unique in

that I am using cephfs(shouldn't matter?) and it currently contains about

310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still

rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro

SSD which contains the journals for said OSD's.  Thats means 30 OSD's in

total.  System disk is on its own disk.  I'm also using a backend network

with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow

when it is close to finishing....say <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk

with no load on the cluster.  Are my expectations off?

Possibly...Ceph basically needs to treat each object as a single IO.

If you're recovering from a failed disk then you've got to replicate

roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly

balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5

hours) worth of work just to read each file — and in reality it's

likely to take more than one IO to read the file, and then you have to

spend a bunch to write it as well.

I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time

is dependent on the number of objects in the pool.  These are thoughts i've

had but am not certain are relevant here.

Rebalance time is dependent on the number of objects in the pool. You

*might* see an improvement by increasing "osd max push objects" from

its default of 10...or you might not. That many small files isn't

something I've explored.

-Greg

$ sudo ceph -v

ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s

[sudo] password for bababurko:

     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79

      health HEALTH_WARN

             5 pgs backfilling

             5 pgs stuck unclean

             recovery 3046506/676638611 objects misplaced (0.450%)

      monmap e1: 3 mons at

{cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}

             election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03

      mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby

      osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs

       pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects

             18319 GB used, 9612 GB / 27931 GB avail

             3046506/676638611 objects misplaced (0.450%)

                 2095 active+clean

                   12 active+clean+scrubbing+deep

                    5 active+remapped+backfilling

recovery io 2294 kB/s, 147 objects/s

$ sudo rados df

pool name                 KB      objects       clones     degraded

unfound           rd        rd KB           wr        wr KB

cephfs_data       6767569962    335746702            0            0

0      2136834            1    676984208   7052266742

cephfs_metadata        42738      1058437            0            0

0     16130199  30718800215    295996938   3811963908

rbd                        0            0            0            0

0            0            0            0            0

   total used     19209068780    336805139

   total avail    10079469460

   total space    29288538240

$ sudo ceph osd pool get cephfs_data pgp_num

pg_num: 1024

$ sudo ceph osd pool get cephfs_metadata pgp_num

pg_num: 1024

thanks,

Bob

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com