HI, On 07/19/13 07:16, Dan van der Ster wrote:
and that gives me something like this: 2013-07-18 21:22:56.546094 mon.0 128.142.142.156:6789/0 27984 : [INF] pgmap v112308: 9464 pgs: 8129 active+clean, 398 active+remapped+wait_backfill, 3 active+recovery_wait, 933 active+remapped+backfilling, 1 active+clean+scrubbing; 15994 GB data, 55567 GB used, 1380 TB / 1434 TB avail; 11982626/151538728 degraded (7.907%); recovering 299 o/s, 114MB/s but immediately I start to see slow requests piling up. Trying with the different combinations, I found that it's the "max active = 10" setting that leads to the slow requests. With a 20/5 setting, there are no slow requests, but the recovery rate doesn't increase anyway. So I'm wondering if you all agree that this indicates that the 10/5 setting for backfill/max active is already the limit for our cluster, at least with this current set of test objects we have? Or am I missing another option that should be tweaked to get more recovery throughput?
this mostly looks like a 1Gb ethernet cap (114MB is 912Mb), its what I get with my small 2-nodes, 6 drives (SSD journals) cluster with a 1Gb/s cluster link, so you should have more out a 10Gb/s network. When I have more, its because of host-to-host moves; when I have less, its because of client's load.
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com