Bad performances in recovery

J-P Methot <jpmethot@xxxxxxxxxx> · Wed, 19 Aug 2015 15:16:39 -0400

Hi,

Our setup is currently comprised of 5 OSD nodes with 12 OSD each, for a
total of 60 OSDs. All of these are SSDs with 4 SSD journals on each. The
ceph version is hammer v0.94.1 . There is a performance overhead because
we're using SSDs (I've heard it gets better in infernalis, but we're not
upgrading just yet) but we can reach numbers that I would consider
"alright".

Now, the issue is, when the cluster goes into recovery it's very fast at
first, but then slows down to ridiculous levels as it moves forward. You
can go from 7% to 2% to recover in ten minutes, but it may take 2 hours
to recover the last 2%. While this happens, the attached openstack setup
becomes incredibly slow, even though there is only a small fraction of
objects still recovering (less than 1%). The settings that may affect
recovery speed are very low, as they are by default, yet they still
affect client io speed way more than it should.

Why would ceph recovery become so slow as it progress and affect client
io even though it's recovering at a snail's pace? And by a snail's pace,
I mean a few kb/second on 10gbps uplinks.
-- 
======================
Jean-Philippe Méthot
Administrateur système / System administrator
GloboTech Communications
Phone: 1-514-907-0050
Toll Free: 1-(888)-GTCOMM1
Fax: 1-(514)-907-0750
jpmethot@xxxxxxxxxx
http://www.gtcomm.net
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com