On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote: > When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very > long time to rebalance. I should note that my cluster is slightly unique in > that I am using cephfs(shouldn't matter?) and it currently contains about > 310 million objects. > > The last time I replaced a disk/OSD was 2.5 days ago and it is still > rebalancing. This is on a cluster with no client load. > > The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro > SSD which contains the journals for said OSD's. Thats means 30 OSD's in > total. System disk is on its own disk. I'm also using a backend network > with single Gb NIC. THe rebalancing rate(objects/s) seems to be very slow > when it is close to finishing....say <1% objects misplaced. > > It doesn't seem right that it would take 2+ days to rebalance a 1TB disk > with no load on the cluster. Are my expectations off? Possibly...Ceph basically needs to treat each object as a single IO. If you're recovering from a failed disk then you've got to replicate roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5 hours) worth of work just to read each file — and in reality it's likely to take more than one IO to read the file, and then you have to spend a bunch to write it as well. > > I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time > is dependent on the number of objects in the pool. These are thoughts i've > had but am not certain are relevant here. Rebalance time is dependent on the number of objects in the pool. You *might* see an improvement by increasing "osd max push objects" from its default of 10...or you might not. That many small files isn't something I've explored. -Greg > > $ sudo ceph -v > ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) > > $ sudo ceph -s > [sudo] password for bababurko: > cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79 > health HEALTH_WARN > 5 pgs backfilling > 5 pgs stuck unclean > recovery 3046506/676638611 objects misplaced (0.450%) > monmap e1: 3 mons at > {cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0} > election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03 > mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby > osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs > pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects > 18319 GB used, 9612 GB / 27931 GB avail > 3046506/676638611 objects misplaced (0.450%) > 2095 active+clean > 12 active+clean+scrubbing+deep > 5 active+remapped+backfilling > recovery io 2294 kB/s, 147 objects/s > > $ sudo rados df > pool name KB objects clones degraded > unfound rd rd KB wr wr KB > cephfs_data 6767569962 335746702 0 0 > 0 2136834 1 676984208 7052266742 > cephfs_metadata 42738 1058437 0 0 > 0 16130199 30718800215 295996938 3811963908 > rbd 0 0 0 0 > 0 0 0 0 0 > total used 19209068780 336805139 > total avail 10079469460 > total space 29288538240 > > $ sudo ceph osd pool get cephfs_data pgp_num > pg_num: 1024 > $ sudo ceph osd pool get cephfs_metadata pgp_num > pg_num: 1024 > > > thanks, > Bob > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com