Christian Schnidrig writes: > Well that’s strange. I wonder why our systems behave so differently. One point about our cluster (I work with Christian, who's still on vacation, and Jens-Christian) is that it has 124 OSDs and 2048 PGs (I think) in the pool used for these RBD volumes. As a result, each connected RBD volume can result in 124 (or slightly less) connections from the RBD client inside Qemu/KVM to each OSD that stores data from that RBD volume. I don't know how librbd's connection management works. I assume that these librbd-to-OSD connections are only created once the client actually tries to access data on that OSD. But when you have a lot of data on the RBD volumes that the VM actually accesses (which we have), then these many connections will actually be created. And apparently librbd doesn't handle the situation very gracefully when its process runs into the limit of open file descriptors. George only has 20 OSDs, so I guess that's an upper bound on the number of TCP connections that librbd will open per RBD volume. He should be safe up to about 50 volumes per VM, assuming the default nfiles limit of 1024. The nasty thing is when everything has been running fine for ages, and then you add a bunch of OSDs, run a few benchmarks, see that everything should run much BETTER (as promised :-), but then suddenly some VMs with lots of mounted volumes mysteriously start hanging. > Maybe the number of placement groups plays a major role as > well. Jens-Christian may be able to give you the specifics of our ceph > cluster. Me too, see above. > I’m about to leave on vacation and don’t have time to look that up > anymore. Enjoy your well-earned vacation!! -- Simon. _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com