Re: optimizing recovery throughput

Mikaël Cluseau <mcluseau@xxxxxx> · Sat, 20 Jul 2013 16:28:17 +1100

HI,

On 07/19/13 07:16, Dan van der Ster wrote:
and that gives me something like this:

2013-07-18 21:22:56.546094 mon.0 128.142.142.156:6789/0 27984 : [INF]
pgmap v112308: 9464 pgs: 8129 active+clean, 398
active+remapped+wait_backfill, 3 active+recovery_wait, 933
active+remapped+backfilling, 1 active+clean+scrubbing; 15994
  GB data, 55567 GB used, 1380 TB / 1434 TB avail; 11982626/151538728
degraded (7.907%);  recovering 299 o/s, 114MB/s

but immediately I start to see slow requests piling up. Trying with
the different combinations, I found that it's the "max active = 10"
setting that leads to the slow requests. With a 20/5 setting, there
are no slow requests, but the recovery rate doesn't increase anyway.

So I'm wondering if you all agree that this indicates that the 10/5
setting for backfill/max active is already the limit for our cluster,
at least with this current set of test objects we have? Or am I
missing another option that should be tweaked to get more recovery
throughput?

this mostly looks like a 1Gb ethernet cap (114MB is 912Mb), its what I 
get with my small 2-nodes, 6 drives (SSD journals) cluster with a 1Gb/s 
cluster link, so you should have more out a 10Gb/s network. When I have 
more, its because of host-to-host moves; when I have less, its because 
of client's load.
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com