On Fri, 2010-12-03 at 09:59 -0700, Gregory Farnum wrote: > On Fri, Dec 3, 2010 at 8:48 AM, Jim Schutt <jaschut@xxxxxxxxxx> wrote: > > I still see lots of clients resetting osds, but it has no > > ill effects now. > This at least is expected -- we realized a few months back that > connections were never being removed from the OSD if the client > crashed (didn't send a FIN notification) and had to implement > timeouts. Having reasonably robust failure handling on each end meant > we didn't need to do anything clever with keepalives, so we just left > it. :) Sure. I only mention it because it suggests that when the osds are overloaded and causing the resets, a little extra work is being done to handle them. > > Separately, > > This combination has survived the heaviest write loads > > (64 clients against 13 osds) that I've tested with to date. > How is this scaling for you on the client side? We're starting to do > more large-scale testing but haven't gotten through much yet! Well, I haven't been paying too much attention to performance yet, and my disks are old and slow (40 MB/s streaming write to a raw block device), so it doesn't take very many clients to saturate my osds. However, I have noticed that aggregate throughput stays about the same once I saturate the osds, no matter how much load I add after that. With my disks, 13 osds (1/server right now), 2 GiB journal partition, and replication level 2, I max out at 100-120 MB/s on streaming writes from lots of clients. I know that's not exactly what you were asking for, but it's all I've got so far.... I'm about to start running with 16 osds/server. Stay tuned... -- Jim > -Greg > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html