Re: rebalancing taking very long time

Alphe Salas <asalas@xxxxxxxxx> · Tue, 8 Sep 2015 11:59:54 -0300

I can say exactly the same I am using ceph sin 0.38 and I never get osd 
so laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94 
serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes 
1.6GB each !!! serriously ! that makes avanche snow.

Let me be straight and explain what changed.

in 0.38 you ALWAYS could stop the ceph cluster and then start it up it 
would evaluate if everyone is back if there is enough replicas then 
start rebuilding /rebalancing what needed of course like 10 minutes was 
necesary to bring up ceph cluster but then the rebuilding /rebalancing 
process was smooth.
With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20 
osd. then you get a disc crash. so ceph starts automatically to rebuild 
and rebalance stuff. and there osd start to lag then to crash
you stop ceph cluster you change the drive restart the ceph cluster 
stops all rebuild process setting no-backfill, norecovey noscrub 
nodeep-scrub you rm the old osd create a new one wait for all osd
to be in and up and then starts rebuilding lag/rebalancing since it is 
automated not much a choice there.

And again all osd are stuck in enless lag/down/recovery intent cycle...

It is a pain serriously. 5 days after changing the faulty disc it is 
still locked in the lag/down/recovery cycle.

Sur it can be argued that my machines are really ressource limited and 
that I should buy 3 thousand dollar worth server at least. But intil 
0.72 that rebalancing /rebuilding process was working smoothly on the 
same hardware.

It seems to me that the rebalancing/rebuilding algorithm is more strict 
now than it was in the past. in the past only what really really needed 
to be rebuild or rebalance was rebalanced or rebuild.

I can still delete all and go back to 0.72... like I should buy a cray 
T-90 to not have anymore problems and have ceph run smoothly. But this 
will not help making ceph a better product.

for me ceph 0.94 is like windows vista...

Alphe Salas
I.T ingeneer

On 09/08/2015 10:20 AM, Gregory Farnum wrote:
On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance.  I should note that my cluster is slightly unique in
that I am using cephfs(shouldn't matter?) and it currently contains about
310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
total.  System disk is on its own disk.  I'm also using a backend network
with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishing....say <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster.  Are my expectations off?

Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.

I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool.  These are thoughts i've
had but am not certain are relevant here.

Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg

$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s
[sudo] password for bababurko:
     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
      health HEALTH_WARN
             5 pgs backfilling
             5 pgs stuck unclean
             recovery 3046506/676638611 objects misplaced (0.450%)
      monmap e1: 3 mons at
{cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
             election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
      mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
      osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
       pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
             18319 GB used, 9612 GB / 27931 GB avail
             3046506/676638611 objects misplaced (0.450%)
                 2095 active+clean
                   12 active+clean+scrubbing+deep
                    5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s

$ sudo rados df
pool name                 KB      objects       clones     degraded
unfound           rd        rd KB           wr        wr KB
cephfs_data       6767569962    335746702            0            0
0      2136834            1    676984208   7052266742
cephfs_metadata        42738      1058437            0            0
0     16130199  30718800215    295996938   3811963908
rbd                        0            0            0            0
0            0            0            0            0
   total used     19209068780    336805139
   total avail    10079469460
   total space    29288538240

$ sudo ceph osd pool get cephfs_data pgp_num
pg_num: 1024
$ sudo ceph osd pool get cephfs_metadata pgp_num
pg_num: 1024

thanks,
Bob

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com