I can say exactly the same I am using ceph sin 0.38 and I never get osd
so laggy than with 0.94. rebalancing /rebuild algorithm is crap in 0.94
serriously I have 2 osd serving 2 discs of 2TB and 4 GB of RAM osd takes
1.6GB each !!! serriously ! that makes avanche snow.
Let me be straight and explain what changed.
in 0.38 you ALWAYS could stop the ceph cluster and then start it up it
would evaluate if everyone is back if there is enough replicas then
start rebuilding /rebalancing what needed of course like 10 minutes was
necesary to bring up ceph cluster but then the rebuilding /rebalancing
process was smooth.
With 0.94 first you have 2 osd too full at 95 % and 4 osd at 63% over 20
osd. then you get a disc crash. so ceph starts automatically to rebuild
and rebalance stuff. and there osd start to lag then to crash
you stop ceph cluster you change the drive restart the ceph cluster
stops all rebuild process setting no-backfill, norecovey noscrub
nodeep-scrub you rm the old osd create a new one wait for all osd
to be in and up and then starts rebuilding lag/rebalancing since it is
automated not much a choice there.
And again all osd are stuck in enless lag/down/recovery intent cycle...
It is a pain serriously. 5 days after changing the faulty disc it is
still locked in the lag/down/recovery cycle.
Sur it can be argued that my machines are really ressource limited and
that I should buy 3 thousand dollar worth server at least. But intil
0.72 that rebalancing /rebuilding process was working smoothly on the
same hardware.
It seems to me that the rebalancing/rebuilding algorithm is more strict
now than it was in the past. in the past only what really really needed
to be rebuild or rebalance was rebalanced or rebuild.
I can still delete all and go back to 0.72... like I should buy a cray
T-90 to not have anymore problems and have ceph run smoothly. But this
will not help making ceph a better product.
for me ceph 0.94 is like windows vista...
Alphe Salas
I.T ingeneer
On 09/08/2015 10:20 AM, Gregory Farnum wrote:
On Wed, Sep 2, 2015 at 9:34 PM, Bob Ababurko <bob@xxxxxxxxxxxx> wrote:
When I lose a disk OR replace a OSD in my POC ceph cluster, it takes a very
long time to rebalance. I should note that my cluster is slightly unique in
that I am using cephfs(shouldn't matter?) and it currently contains about
310 million objects.
The last time I replaced a disk/OSD was 2.5 days ago and it is still
rebalancing. This is on a cluster with no client load.
The configurations is 5 hosts with 6 x 1TB 7200rpm SATA OSD's & 1 850 Pro
SSD which contains the journals for said OSD's. Thats means 30 OSD's in
total. System disk is on its own disk. I'm also using a backend network
with single Gb NIC. THe rebalancing rate(objects/s) seems to be very slow
when it is close to finishing....say <1% objects misplaced.
It doesn't seem right that it would take 2+ days to rebalance a 1TB disk
with no load on the cluster. Are my expectations off?
Possibly...Ceph basically needs to treat each object as a single IO.
If you're recovering from a failed disk then you've got to replicate
roughly 310 million * 3 / 30 = 31 million objects. If it's perfectly
balanced across 30 disks that get 80 IOPS that's 12916 seconds (~3.5
hours) worth of work just to read each file — and in reality it's
likely to take more than one IO to read the file, and then you have to
spend a bunch to write it as well.
I'm not sure if my pg_num/pgp_num needs to be changed OR the rebalance time
is dependent on the number of objects in the pool. These are thoughts i've
had but am not certain are relevant here.
Rebalance time is dependent on the number of objects in the pool. You
*might* see an improvement by increasing "osd max push objects" from
its default of 10...or you might not. That many small files isn't
something I've explored.
-Greg
$ sudo ceph -v
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
$ sudo ceph -s
[sudo] password for bababurko:
cluster f25cb23f-2293-4682-bad2-4b0d8ad10e79
health HEALTH_WARN
5 pgs backfilling
5 pgs stuck unclean
recovery 3046506/676638611 objects misplaced (0.450%)
monmap e1: 3 mons at
{cephmon01=10.15.24.71:6789/0,cephmon02=10.15.24.80:6789/0,cephmon03=10.15.24.135:6789/0}
election epoch 20, quorum 0,1,2 cephmon01,cephmon02,cephmon03
mdsmap e6070: 1/1/1 up {0=cephmds01=up:active}, 1 up:standby
osdmap e4395: 30 osds: 30 up, 30 in; 5 remapped pgs
pgmap v3100039: 2112 pgs, 3 pools, 6454 GB data, 321 Mobjects
18319 GB used, 9612 GB / 27931 GB avail
3046506/676638611 objects misplaced (0.450%)
2095 active+clean
12 active+clean+scrubbing+deep
5 active+remapped+backfilling
recovery io 2294 kB/s, 147 objects/s
$ sudo rados df
pool name KB objects clones degraded
unfound rd rd KB wr wr KB
cephfs_data 6767569962 335746702 0 0
0 2136834 1 676984208 7052266742
cephfs_metadata 42738 1058437 0 0
0 16130199 30718800215 295996938 3811963908
rbd 0 0 0 0
0 0 0 0 0
total used 19209068780 336805139
total avail 10079469460
total space 29288538240
$ sudo ceph osd pool get cephfs_data pgp_num
pg_num: 1024
$ sudo ceph osd pool get cephfs_metadata pgp_num
pg_num: 1024
thanks,
Bob
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com