Re: rebalancing taking very long time

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Agreed with Alphe , Ceph Hammer (0.94.2) sucks when it comes to recovery and rebalancing.

Here is my Ceph Hammer cluster , which is like this for more than 30 hours.

You might be thinking about that one OSD which is down and not in.  Its intentional, i want to remove that OSD.
I want the cluster to become healthy again before i remove that OSD.

Can someone help us with this problem

 cluster 86edf8b8-b353-49f1-ab0a-a4827a9ea5e8
     health HEALTH_WARN
            14 pgs stuck unclean
            5 requests are blocked > 32 sec
            recovery 420/28358085 objects degraded (0.001%)
            recovery 199941/28358085 objects misplaced (0.705%)
            too few PGs per OSD (28 < min 30)
     monmap e3: 3 mons at {stor0201=10.100.1.201:6789/0,stor0202
=10.100.1.202:6789/0,stor0203=10.100.1.203:6789/0}
            election epoch 1076, quorum 0,1,2 stor0201,stor0202,
stor0203
     osdmap e778879: 96 osds: 95 up, 95 in; 14 remapped pgs
      pgmap v2475334: 896 pgs, 4 pools, 51364 GB data, 9231 kobjects
            150 TB used, 193 TB / 344 TB avail
            420/28358085 objects degraded (0.001%)
            199941/28358085 objects misplaced (0.705%)
                 879 active+clean
                  14 active+remapped
                   3 active+clean+scrubbing+deep



On Tue, Sep 8, 2015 at 5:59 PM, Alphe Salas <asalas@xxxxxxxxx> wrote:
I can say exactly the same I am using ceph sin 0.38 and I never get osd so laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94 serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes 1.6GB each !!! serriously ! that makes avanche snow.

Let me be straight and explain what changed.

in 0.38 you ALWAYS could stop the ceph cluster and then start it up it would evaluate if everyone is back if there is enough replicas then start rebuilding /rebalancing what needed of course like 10 minutes was necesary to bring up ceph cluster but then the rebuilding /rebalancing process was smooth.
With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20 osd. then you get a disc crash. so ceph starts automatically to rebuild and rebalance stuff. and there osd start to lag then to crash
you stop ceph cluster you change the drive restart the ceph cluster stops all rebuild process setting no-backfill, norecovey noscrub nodeep-scrub you rm the old osd create a new one wait for all osd
to be in and up and then starts rebuilding lag/rebalancing since it is automated not much a choice there.

And again all osd are stuck in enless lag/down/recovery intent cycle...

It is a pain serriously. 5 days after changing the faulty disc it is still locked in the lag/down/recovery cycle.

Sur it can be argued that my machines are really ressource limited and that I should buy 3 thousand dollar worth server at least. But intil 0.72 that rebalancing /rebuilding process was working smoothly on the same hardware.

It seems to me that the rebalancing/rebuilding algorithm is more strict now than it was in the past. in the past only what really really needed to be rebuild or rebalance was rebalanced or rebuild.

I can still delete all and go back to 0.72... like I should buy a cray T-90 to not have anymore problems and have ceph run smoothly. But this will not help making ceph a better product.

for me ceph 0.94 is like windows vista...

Alphe Salas
I.T ingeneer


On 09/08/2015 10:20 AM, Gregory Farnum wrote:
On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance.  I should note that my cluster is slightly unique in
that I am using cephfs(shouldn't matter?) and it currently contains about
310 million objects.

The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing.  This is on a cluster with no client load.

The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's.  Thats means 30 OSD's in
total.  System disk is on its own disk.  I'm also using a backend network
with single Gb NIC.  THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishing....say <1% objects misplaced.

It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster.  Are my expectations off?

Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.


I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool.  These are thoughts i've
had but am not certain are relevant here.

Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg


$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)

$ sudo ceph -s
[sudo] password for bababurko:
     cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
      health HEALTH_WARN
             5 pgs backfilling
             5 pgs stuck unclean
             recovery 3046506/676638611 objects misplaced (0.450%)
      monmap e1: 3 mons at
{cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
             election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
      mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
      osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
       pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
             18319 GB used, 9612 GB / 27931 GB avail
             3046506/676638611 objects misplaced (0.450%)
                 2095 active+clean
                   12 active+clean+scrubbing+deep
                    5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s

$ sudo rados df
pool name                 KB      objects       clones     degraded
unfound           rd        rd KB           wr        wr KB
cephfs_data       6767569962    335746702            0            0
0      2136834            1    676984208   7052266742
cephfs_metadata        42738      1058437            0            0
0     16130199  30718800215    295996938   3811963908
rbd                        0            0            0            0
0            0            0            0            0
   total used     19209068780    336805139
   total avail    10079469460
   total space    29288538240

$ sudo ceph osd pool get cephfs_data pgp_num
pg_num: 1024
$ sudo ceph osd pool get cephfs_metadata pgp_num
pg_num: 1024


thanks,
Bob

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux